Why Your Cheapest Model Should Write the Harness

written by Stefan Christoph

June 8, 2026 - 7 minutes read

TL;DR: A May 2026 paper separates two capabilities that self-improving agents usually conflate: writing harness updates and benefiting from them. Writing is flat across model tiers: a 9B open model produces updates roughly as useful as a frontier model. Benefiting is an inverted-U, where mid-tier models gain most, strong models sit near the ceiling, and weak models can’t follow the harness they’re given. The practical move: put your cheap model in the evolver seat and your expensive model in the solver seat. I reproduced the mechanism on Amazon Bedrock, where a Haiku-written skill lifted a Sonnet solver from fail to pass.

The Question I Kept Getting Wrong

I run most of my work through an agent driven by an editable harness: prompts, skills, and memories the agent loads when a task matches. When I started letting the agent improve its own harness, I defaulted to the obvious setup: use the best model I have to write the updates. Best model, best skills, right?

A new paper says that instinct is wrong, and backs it with numbers. Harness Updating Is Not Harness Benefit (Lin et al., arXiv:2605.30621, with co-authors from Penn State, UC Santa Cruz, and Amazon) [1] takes the self-evolving-agent loop apart and measures something most evaluations blur together.

Two Capabilities Everyone Conflates

In a self-evolving agent there are two distinct jobs, usually done by the same model:

Harness-updating (the evolver): read execution evidence (failures, traces, verifier feedback) and write a persistent update back into the harness.
Harness-benefit (the solver): actually use that updated harness to solve the next task.

End-to-end scores hide which one is doing the work. The paper fixes one role, varies the other, and measures each separately across seven LLMs (Claude Opus 4.6 / Sonnet 4.6 / Haiku 4.5, plus open models including Qwen3.5-9B and GPT-OSS-120B) on three agentic benchmarks (SWE-bench Verified, MCP-Atlas, SkillsBench) [1].

Finding 1: Writing the Harness Is Flat in Capability

Fix the solver, swap the evolver, and the gains barely move. The spread between the best and worst evolver is at most 3.1 percentage points on any benchmark, and no model wins everywhere. The smallest model in the study, Qwen3.5-9B, posts the highest evolver gain on SkillsBench (3.8 pp), ahead of Opus 4.6 (2.3 pp) [1].

The case study is the punchline: a 9B evolver and a frontier evolver, asked to write a skill for the same task, produce skills that are procedurally isomorphic. Same steps, differing only in surface wording. Both lift the downstream agent identically.

A 9B model and a frontier model both produce skills that give the solver the same gain — Writing the harness is flat: a small model and a frontier model produce skills that lift the solver identically.

Finding 2: Benefiting From It Is an Inverted-U

Now fix the evolver and swap the solver. Gains do not rise with capability. On SWE-bench the benefit peaks at a mid-tier model (Qwen3-235B, +19.3 pp) while strong Opus 4.6 gains only +2.6 pp; on MCP it peaks at GPT-OSS-120B (+7.0 pp) [1]. Strong models sit near the ceiling, with little headroom. The weak end is the interesting part: the models with the most headroom benefit the least.

The paper traces that to two failure modes, both measurable:

Activation failure: the weak model never loads the relevant skill. Skill-load rate is about 96% for strong models but 25% for Qwen3-32B [1].
Adherence failure: even when loaded, it stops following the skill as the task gets longer. Per-phase adherence drifts from load to final by −0.39 for the weak model versus −0.09 for Opus 4.6 [1].

So “give the weak model a better harness” doesn’t rescue it. It can’t pick the skill up, and when it does, it drops it halfway.

I Ran the Mechanism on Bedrock

I wanted to feel the core claim, not just read it, so I built the smallest honest version of the loop on Amazon Bedrock: a cheap Haiku evolver writing a skill that a stronger Sonnet solver then uses. No benchmark harness, just the solve → evolve → solve loop through the Bedrock Converse API [2].

The task counts “completed” jobs from an event log, with a non-obvious convention the bare prompt doesn’t state: a job that emitted a FINISH event completed its run even if later cancelled. The solver can’t guess that. The evolver learns it from verifier-labeled feedback and writes it into a skill.

The core solve → evolve → solve loop (Bedrock Converse):

def ask(model, system, user):
    r = brt.converse(
        modelId=model,
        system=[{"text": system}],
        messages=[{"role": "user", "content": [{"text": user}]}],
        inferenceConfig={"temperature": 0, "maxTokens": 512},
    )
    return r["output"]["message"]["content"][0]["text"]

# 1) strong solver, no skill  -> fails the hidden convention
# 2) cheap evolver reads the failure + verifier labels -> writes a skill
# 3) strong solver + skill     -> passes

The run, verbatim:

solver  = us.anthropic.claude-sonnet-4-6
evolver = us.anthropic.claude-haiku-4-5-20251001
ground_truth = 6

[1] Solver (no skill)  -> 4   FAIL
[2] Skill written by CHEAP evolver (Haiku 4.5):
    "A job is COMPLETED if its event is FINISH, regardless of status.
     Count all jobs with event=FINISH."
[3] Solver (+ skill)   -> 6   PASS

RESULT: cheap evolver's skill fixed the solver

The cheap model wrote the skill that fixed the expensive one. That is the paper’s thesis in eight lines of output.

Two honest caveats, because the mechanism has sharp edges:

This is a mechanism demo, not a benchmark reproduction. It shows the loop, not the paper’s averages.
The evolver only wrote the right skill once I gave it verifier-labeled outcomes (which jobs counted). With a thinner signal it confidently wrote the wrong rule. The cheap evolver matches the expensive one when the feedback is good, so the evidence quality is the real dependency, not model size.

What This Changes About Where You Spend

If writing is flat and benefiting peaks mid-tier, the budget logic flips:

Decision flow: use a cheap model for the evolver seat and a strong model for the solver seat — Where the budget should go: cheap evolver, strong solver, and the savings into verifiers.

Don’t pay frontier prices to author skills. Run the evolver on a cheap model in the background.
Spend on the solver. That’s where post-evolution score actually moves.
Invest in the verifier, not the evolver. The cheap evolver only works with a real signal. A good verifier is worth more than a bigger author.
For weak deployed agents, harness updates aren’t the fix. Train (or pick) for skill invocation and long-horizon instruction-following first.

If You’re Running This on AWS

The setup above maps cleanly onto managed building blocks:

Amazon Bedrock Converse API gives you one call across model tiers [2], so swapping a cheap evolver for a strong solver is a modelId change, not a rewrite. Set temperature=0 for reproducible skills.
Amazon Bedrock AgentCore Gateway turns existing REST APIs and Lambda functions into MCP-compatible tools for the solver to use [3]. That is the “tools” half of the harness, managed.
AgentCore Runtime + Memory give you somewhere to persist the evolved skills and memories between runs rather than rebuilding the harness each time [4].

A pragmatic pattern: schedule the evolver as a cheap, off-peak batch job that reads yesterday’s traces and verifier results and commits harness updates; keep the solver on your strong model for live, customer-facing work. You pay frontier rates only where they change the outcome.

The Bigger Pattern

I keep landing on the same idea from different directions: the advantage in agentic systems is moving out of the model and into the harness around it. I argued the harness is the primary control surface in On the Loop [5] and made it Pattern 5 in From Cloud-Native to AI-Native [6]; I called harness quality the new infrastructure in Code Quality as New Infrastructure [7]; and I looked at self-improvement itself in Self-Improving Models [8]. This paper adds the missing economic detail: who writes the harness matters far less than who runs it. That’s a gift, because it means the compounding asset, the harness itself, is cheap to maintain and expensive only to use well.

The harness is where your agent’s competence actually lives. You just don’t need your most expensive model to write it down.

Where are you spending today: on the model that writes the skills, or the one that runs them? And do you have a verifier good enough to trust the cheap author?

Sources

Lin et al., Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents, arXiv:2605.30621 (2026). Code: https://github.com/A-EVO-Lab/a-evolve
Amazon Bedrock Converse API: https://docs.aws.amazon.com/bedrock/latest/userguide/conversation-inference.html
Amazon Bedrock AgentCore Gateway: https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/gateway.html
Amazon Bedrock AgentCore: https://aws.amazon.com/bedrock/agentcore/
On the Loop: Code Quality: https://schristoph.online/blog/on-the-loop-code-quality/
From Cloud-Native to AI-Native: https://schristoph.online/blog/from-cloud-native-to-ai-native/
Code Quality Is the New Infrastructure: https://schristoph.online/blog/code-quality-new-infrastructure/
Self-Improving Models: https://schristoph.online/blog/self-improving-models/

About the Author

Stefan Christoph is a Principal Solutions Architect at AWS, focused on agentic AI, media & entertainment, and helping builders move from demo to production. He writes about AI architecture, developer productivity, and the future of software.

This is a personal blog. Opinions expressed here are my own and do not represent the views or positions of my employer.

Learn more →

Cross-posted to LinkedIn

❤️ Created with the support of AI (Kiro)