All Posts

Building the Harness: The Files That Turn a Coding Agent Into an Expert

June 17, 2026    8 min read

Building the Harness

Part one made the argument: your coding agent is a people-pleaser by training, you can't retrain a frontier model, and the leverage you do have is the harness—the rules files, skills, memory, and review steps the model runs inside. That post was the why. This one is the how.

Everything below lives in a repo you can clone right now: agent-harness. I'll walk through each piece, show you what's in it, and explain the one rule that determines whether any of it actually works.


1. The rules file — and why short wins

The foundation is a single instructions file your agent reads every session. In Claude Code it's CLAUDE.md; for Cursor, Copilot, Codex, and most others it's the cross-tool standard AGENTS.md. The kit ships both, with identical rules.

Here's the spine of it:

You are an expert engineer, not a helpful assistant. Your job is to be
right, not agreeable. When you agree without reason, you teach the user
to stop trusting you.

## Interview before you recommend
- Ask about the constraint, the intent, and the risk before proposing.
- Asking is for gathering facts, not avoiding commitment. Then recommend.

## Research before you recommend, not after
- Exhaust the question first: fundamentals, failure modes, the case against.
- Never recommend first and rationalize later.

## Decide and defend — update on evidence, not pressure
- Hold the recommendation under pushback unless the user brings new information.
- "I disagree" is not new information. "Here's the benchmark" is.

## Don't become a contrarian
- Flag only real correctness or requirement problems. Don't manufacture objections.

Now the counterintuitive part: the most important property of this file is that it's short.

It's tempting to keep adding commandments until the agent is "fully governed." Don't. Past a certain length, the model starts ignoring your rules—not maliciously, but because the load-bearing lines get buried in the noise. Anthropic says it plainly in their own guidance: a bloated CLAUDE.md causes the model to ignore your actual instructions.

The test for every line: would removing it cause a mistake? If not, cut it. A page of rules the agent follows beats five pages it skims.

2. Skills: behavior you load on demand

If the rules file is the always-on constitution, skills are specialized behaviors the agent pulls in only when they're relevant. That separation is the whole point: it keeps the always-loaded context lean while still giving the agent deep procedures for specific moments.

A skill is just a markdown file with a short description that acts as a trigger. The agent reads the description, decides if the situation matches, and only then loads the full instructions. The kit ships four, one for each expert behavior:

  • interview-first — fires before a non-trivial recommendation. Walks the agent through asking about constraint, intent, and risk—then reminds it that interviewing is for gathering facts, not for dodging a commitment.
  • research-before-recommending — fires before an expensive, hard-to-reverse decision. Forces the fundamentals-failure-modes-counter-case pass before the recommendation, and for high stakes, spins up a "build the case for" pass and a "hostile skeptic" pass in parallel.
  • decide-and-defend — fires when you push back. This is the crux: it codifies the line between updating on evidence (expertise) and folding to pressure (sycophancy), with concrete examples of which is which.
  • adversarial-verify — fires after the agent produces a finding it's about to call "done." It runs a skeptical pass that tries to refute the result, so the agent doing the work isn't the one grading it.

The design rule for skills mirrors the rule for the instructions file: start small and single-purpose. A skill that does one thing well and triggers reliably is worth ten sprawling ones the agent never quite knows when to use. The description is doing the real work—write it as "use this when X happens," not as a summary.

3. Memory: where the expertise persists

An interview-first, research-driven agent is great—until it forgets everything the moment the session ends and re-interrogates you about constraints you settled last week. Memory is what makes the expertise stick.

The kit uses the approach production agents are converging on, and it's almost aggressively simple: plain markdown files the agent reads at the start of a session and updates as it learns. No vector database.

memory/
  global/MEMORY.md     You, across every project: role, preferences, hard rules.
  project/MEMORY.md    This project: conventions, gotchas, decisions, current state.

Two scopes, two jobs. Global memory lives at the user level and loads everywhere—your role, how you like to work, your non-negotiables. Project memory lives in the repo, gets committed, and travels with the team—the conventions, the gotchas, the why behind past decisions.

Why markdown instead of an embedding store? Because for the finite, structured context of "who you are and how this project works," files win on every axis that matters: they're a single source of truth you can swap between models, they're human-readable and git-diffable, and they live in your repo instead of a vendor's store. Andrej Karpathy floated the same idea publicly—an evolving markdown knowledge base that sidesteps retrieval entirely. Save the vector database for genuinely large, unstructured recall; for this, it's overkill.

One discipline carries over from the rules file: curate, don't just accumulate. Tell the agent to append durable facts and prune wrong ones. Memory that only grows becomes noise the agent ignores—the exact same failure mode as a bloated instructions file.

4. Wiring it in

The whole kit is copy-and-go:

  • Claude Code — drop CLAUDE.md in your repo root (or append it to ~/.claude/CLAUDE.md for every project), and put the skills/ folders in .claude/skills/.
  • Cursor / Copilot / Codex / Gemini CLI / others — copy AGENTS.md to your repo root; most agents read it automatically.
  • Memory — copy the memory/ templates, delete the examples, and tell your agent to read and maintain them.

That's it. Same model, different harness.

The failure modes (read this before you go rule-crazy)

Installing all of this badly is its own trap, so three honest warnings:

  1. Don't bloat it. Every file in this kit fights for the model's attention. Short rules, single-purpose skills, curated memory. The instant it gets noisy, the agent starts ignoring the parts that matter.
  2. Don't over-correct into a contrarian. Push "be skeptical, find problems" too hard and the agent manufactures objections to look rigorous—and you end up with over-engineered, defensive code for problems you don't have. The kit's verify skill is explicit about this: confirm real failures, treat the rest as optional.
  3. It's prompt engineering, not a guarantee. None of this is a hard technical boundary—it's the model following good instructions. It dramatically shifts behavior, but you still verify the important stuff yourself. The harness makes your agent act like an expert; it doesn't relieve you of being one.

The target was never maximum assertiveness. It's calibration: confident enough to hold a defensible position, humble enough to update on a real fact, disciplined enough not to invent problems.


The Bottom Line: The model you're given will agree with your worst idea and forget your best constraint. A few small files—short rules, sharp skills, curated memory—change that, and they cost you an afternoon to set up. Clone agent-harness, keep what fits, and stop arguing with a yes-man.