Code Quality Is the New Infrastructure

Code quality is the foundation. AI agents are the building.
Ten talks. Ten practitioners from different companies — Anthropic, Google, HashiCorp, Thoughtworks, Answer.AI, Factory, and independent creators. None of them coordinated. All of them arrived at the same conclusion.
Clean code isn’t a nice-to-have in the age of agents. It’s infrastructure.
I spent the last few weeks watching the Pragmatic Engineer podcast series on AI-assisted software engineering, plus Jeremy Howard’s deep dive on Machine Learning Street Talk and Eno Reyes’s talk at the AI Engineer Summit. These aren’t pundits speculating about the future — they’re builders shipping production software with AI agents every day. And the pattern that emerges across all of them is striking: the bottleneck for agent productivity isn’t model capability. It’s the codebase the model has to work with.
The Data: This Isn’t Anecdotal
Jeremy Howard — deep learning pioneer, Kaggle grandmaster, CEO of Answer.AI — brings the hardest data. His team studied what people actually ship with AI coding tools. The result: “a tiny uptick.” Not 10x. Not even 2x. A tiny uptick. Howard himself has roughly 90% of his code typed by LLMs, yet isn’t dramatically more productive — because typing was never the bottleneck. Fred Brooks predicted this in 1986: the essential complexity of software isn’t in the syntax.
Boris Cherny — creator of Claude Code, formerly leading code quality across all of Meta — provides the causal link. At Meta, his “Better Engineering” program proved through rigorous analysis that code quality contributes double-digit percentage points to engineering productivity. Not a feel-good metric. A measured, causal relationship. And the insight carries directly into the agentic era: clean codebases with finished migrations, consistent patterns, and clear naming make agents faster and more reliable. Poorly named functions send agents confidently in the wrong direction.
Martin Fowler — 40+ years in software, author of Refactoring — frames the theoretical shift. AI introduces non-determinism into a discipline that was fundamentally deterministic. Previous tools were predictable: same input, same output. LLMs aren’t. This requires what Fowler calls “tolerance thinking” borrowed from structural engineering — you can’t skate close to the edge because bridges will collapse. And what determines how close to the edge you are? The quality of the codebase the agent operates in.
Fowler’s characterization of working with LLMs is memorable: treat every output as a PR from “a rather dodgy collaborator who’s very productive in the lines-of-code sense of productivity — but you can’t trust a thing that they’re doing.” The speed-up is real but modest. The risk is real and compounding.
The Mechanism: Why Agents Spiral on Bad Code
I’ve written before about watching an agent waste 15 minutes chasing a bug that didn’t exist because a function called transformPayload() didn’t transform anything — it validated. The agent built three layers of transformation logic on top of it before realizing the name was a lie.
That’s not an edge case. It’s the default failure mode. And the research notes reveal exactly why.
Naming is context. Variable and function names are the first thing both humans and agents read to build a mental model. Bad names don’t just slow agents down — they send them confidently in the wrong direction. Cherny saw this at Meta scale. I see it daily in customer codebases. The agent reads doSomeMagic(), assumes it does magic, and builds on that assumption until reality intervenes — usually several hundred tokens later.
Coupling creates blast radius. When an agent changes one module and three unrelated tests break, it doesn’t understand why any better than a junior developer would. It just burns more tokens trying. Tight coupling, hidden side effects, implicit dependencies — these aren’t just “developer experience” problems. They’re the exact patterns that cause agents to spiral. Each failed attempt adds noise to the context window, degrading subsequent reasoning.
Shallow modules multiply the problem. Matt Pocock, citing John Ousterhout’s A Philosophy of Software Design, identifies a specific architectural pattern that AI produces by default: shallow modules — many small files with complex dependency graphs. Deep modules — small interface, lots of functionality — are easier to test, easier for agents to navigate, and produce better feedback loops. Pocock puts it bluntly: “Bad codebases make bad agents. If you have a garbage codebase, you’re going to get garbage out of the agent.”
Eno Reyes from Factory AI quantifies the organizational dimension: most codebases operate at 50–60% test coverage with flaky builds that humans silently tolerate. That’s fine for human developers who test manually and catch issues through intuition. It breaks agents completely. The limiter isn’t model capability — it’s your organization’s automated validation infrastructure.
The Practitioner Playbook
What’s striking about these ten talks isn’t just the diagnosis. It’s that the prescriptions converge too. Three patterns emerge repeatedly.
TDD as Trust Mechanism
Kent Beck — creator of TDD, 52-year programmer — now spends 6–10 hours a day coding with agents. His central metaphor: AI agents are genies. You make a wish, and you get something — but it’s not always what you wanted. He describes asking an agent to implement a parser, only to discover it had built a hardcoded lookup table. He erased it. An hour later, the lookup table was back.
Beck’s solution: tests as immutable constraints the agent cannot cross. His test suite runs in 300 milliseconds. It runs constantly. The human writes the expected behavior; the agent figures out how to satisfy it. “90% of my skills just went to zero dollars and 10% of my skills just went up a thousand times.” Knowing where to put brackets — zero. Having a vision, controlling complexity, maintaining design coherence — 1000x.
Simon Willison — creator of Datasette, prolific open-source developer — starts every agent session with five tokens: “use red green TDD.” The agent writes a test, watches it fail, writes the minimal implementation to pass. Tests used to be optional extra work; with agents, they’re effectively free. “Tests are no longer even remotely optional.”
Willison adds a crucial layer: manual verification on top. He tells agents to start the server and exercise the API with curl. Passing tests don’t guarantee the server boots. His new tool Showboat generates a markdown document of these manual test runs — a human-readable proof of correctness.
Deep Modules and Codebase Architecture
Pocock’s extended workshop at the AI Engineer Conference makes the architectural case concrete. Even with 1M context windows, useful reasoning degrades around 100K tokens. The extra capacity is retrieval headroom, not coding headroom. Every unnecessary file, every shallow abstraction, every indirection layer eats into that budget.
Deep modules — Ousterhout’s concept — are the antidote. Small interface, lots of functionality behind it. Easier to test, easier for agents to navigate, fewer files to load into context. Pocock ships a skill that scans codebases for deepening opportunities. The principle is simple: reduce the surface area the agent has to reason about.
This connects to what Reyes calls the “DevX flywheel.” Better validation makes agents more effective. More effective agents can identify validation gaps and remediate them — generate missing tests, tighten linter rules, add documentation. This improves the environment, which makes agents even more effective. The real productivity gains come not from switching between coding tools that differ by 10% on benchmarks, but from investing in the environment that makes all tools succeed.
Harness Engineering
Mitchell Hashimoto — co-founder of HashiCorp, creator of Terraform — articulates the meta-principle: when an agent makes a mistake, don’t just fix it — build tooling that would have prevented or course-corrected the mistake. He calls this “harness engineering” and considers it the key shift in the agentic era: moving from building the product to building the harness for product development.
Peter Steinberger — creator of PSPDFKit (on 1B+ devices) and Clawd — implements this at the extreme. He ships code he doesn’t read. Provocative, but the system underneath is disciplined: every feature must have a way for the agent to verify its own work. He builds CLI wrappers around GUI features specifically so agents can test in tight loops. “The whole secret is closing the loop. The model always needs to be able to verify the work itself.”
Hashimoto’s agents.md pattern — every time an agent makes a mistake, document it so it never makes that mistake again — is the simplest version of harness engineering. The compound effect over weeks is significant. The harness accumulates institutional knowledge that makes every subsequent agent session more reliable.
Addy Osmani from Google’s Chrome team adds the review dimension: as AI increases code velocity, human review capacity doesn’t scale with it. The temptation is to let AI review AI-generated code. Osmani flags this as a slippery slope: “If AI writes and reviews, who actually knows what’s landing?” The answer isn’t to remove the bottleneck but to treat it as the quality gate it always was — and invest in the harness that reduces what needs human review.
The Organizational Dimension
This is where the conversation gets uncomfortable. Code quality has always been a team sport, but most organizations treat it as a developer preference rather than infrastructure investment.
Reyes frames it through a Google analogy: why can a new grad at Google ship a change to YouTube’s border radius without taking down a billion-user service? Not because of individual skill — because of the validation infrastructure. Coding agents are the new “zero-context new grads.” They can be equally productive if the guardrails exist.
Cherny’s experience at Meta is instructive. Zuckerberg mandated that every engineer spend 20% of time on tech debt — the “Better Engineering” program. Cherny’s team proved the ROI through causal analysis. Most organizations don’t have that mandate. They have backlogs full of features and tech debt that accumulates silently — until agents arrive and the debt becomes visible as wasted tokens, failed generations, and spiraling context windows.
Howard raises the deepest organizational concern: understanding debt. When developers stop writing code, they stop building mental models. The Anthropic study on Claude Code confirmed this — so little friction that people didn’t learn anything. Organizations that outsource cognition to LLMs erode the knowledge that makes them adaptive. The codebase degrades not because the code gets worse, but because nobody understands it well enough to know what “worse” means.
Fowler’s advice cuts through the complexity: don’t try to do more per cycle — increase cycle frequency. Do half as much in half the time. The fundamental agile insight — tight feedback loops drive learning — applies more than ever. And tight feedback loops require clean code, fast tests, and clear contracts. The infrastructure, again.
The Convergence
Here’s what strikes me most: these ten practitioners didn’t read each other’s talks. Howard didn’t cite Cherny. Beck didn’t reference Pocock. Hashimoto and Steinberger arrived at harness engineering independently. Willison and Beck both landed on TDD as the trust mechanism without coordinating.
When independent observers converge on the same conclusion from different starting points, that’s signal. And the signal is clear:
The organizations that will thrive with AI agents are the ones that treat code quality as infrastructure, not as a developer preference, not as a nice-to-have, not as something you’ll get to after the next sprint. Infrastructure. The kind you invest in before you build on top of it.
Clean naming. Deep modules. Fast tests. Automated validation. Harness engineering. These aren’t relics of the pre-AI era. They’re the foundation that makes the AI era actually work.
The agents are only as good as the codebase they operate in. Invest accordingly.
If You’re Building This with Kiro
The patterns above (spec-driven development, TDD, deep modules, harness engineering) are exactly what Kiro [11] was designed to operationalize. Here’s how the practitioner playbook maps to concrete Kiro features.
Spec-Driven Development as Default
When you start a feature in Kiro’s spec mode, it doesn’t jump to code generation. It creates three specification files first [12]:
- requirements.md breaks your prompt into structured requirements using the EARS format (Easy Approach to Requirements Syntax): “When the user submits a form with invalid email, the system shall display an inline error message.” Precise, testable, unambiguous.
- design.md captures architecture decisions, component interfaces, and data flows. This is where deep module boundaries get defined before any code exists.
- tasks.md decomposes the design into discrete, testable implementation tasks with explicit acceptance criteria.
Each phase produces artifacts that constrain the next. The agent doesn’t generate code from a vague prompt. It generates code that satisfies acceptance criteria derived from requirements you approved. When something breaks, the spec tells you whether the code is wrong or the requirement was incomplete.
This is the “on the loop” pattern made tangible. The spec is the harness. You design it. The agent runs within it.
Steering Files: Persistent Code Quality Context
Kiro’s steering files [13] give the agent persistent knowledge about your codebase through markdown files: structure.md (architecture), tech.md (stack and patterns), and product.md (business context). These aren’t one-time prompts. They persist across sessions, so the agent always knows your naming conventions, module boundaries, and architectural constraints.
This directly addresses the “agents spiral on bad code” problem. When the agent knows that transformPayload() is actually a validation function (because structure.md says so), it doesn’t build three layers of transformation logic on top of a misleading name.
Property-Based Testing Built In
Kiro generates property-based tests as part of every spec workflow [14]. For bugfix specs, it creates tests that verify the bug exists, tests that verify the fix resolves it, and tests that verify unchanged behavior continues working. This is Beck’s TDD loop and Willison’s “5-token instruction” automated into the development flow.
A Real-World Example
An AWS team recently built a production drug discovery agent using Kiro’s spec-driven approach [15]. Three solution architects, three weeks, from spec to production. Kiro generated over 95% of the business logic code. The key: they invested time upfront in detailed specifications and used steering files to maintain consistent architectural decisions across the entire codebase. The specs became living documentation that the agent referenced throughout development.
The pattern works because it encodes exactly what the ten practitioners in this article converge on: clean structure, explicit contracts, fast feedback loops, and persistent context. Kiro just makes it the default workflow instead of a discipline you have to maintain manually.
Sources:
- Jeremy Howard — “The Dangerous Illusion of AI Coding?” (Machine Learning Street Talk): YouTube
- Boris Cherny — “Building Claude Code” (Pragmatic Engineer): YouTube
- Martin Fowler — “How AI Will Change Software Engineering” (Pragmatic Engineer): YouTube
- Kent Beck — “TDD, AI Agents and Coding” (Pragmatic Engineer): YouTube
- Simon Willison — “Engineering Practices That Make Coding with AI Work” (Pragmatic Engineer): YouTube
- Mitchell Hashimoto — “Mitchell Hashimoto’s New Way of Writing Code” (Pragmatic Engineer): YouTube
- Peter Steinberger — “I Ship Code I Don’t Read” (Pragmatic Engineer): YouTube
- Addy Osmani — “Beyond Vibe Coding” (Pragmatic Engineer): YouTube
- Matt Pocock — “Essential Skills for AI Coding” (AI Engineer Conference): YouTube
- Eno Reyes — “Making Codebases Agent Ready” (AI Engineer Summit): YouTube
- Stefan Christoph — “On the Loop, Not In It — But Code Quality Still Matters”: schristoph.online
- [11] Kiro — Spec-Driven Agentic IDE: kiro.dev
- [12] Kiro Blog — “From chat to specs: a deep dive into AI-assisted development with Kiro”: kiro.dev
- [13] AWS — “From spec to production: a three-week drug discovery agent using Kiro”: AWS Blog
- [14] Kiro Blog — “New spec types: fix bugs and build on top of existing apps”: kiro.dev
- [15] AWS — “From spec to production: a three-week drug discovery agent using Kiro”: AWS Blog
What’s your experience — do your agents perform better on clean codebases? I’d love to hear concrete examples.
#CodeQuality #AgenticCoding #SoftwareEngineering #AI #TDD #DeveloperProductivity