Cognitive Debt: The Hidden Cost of AI-Generated Code
- 11 minutes readThe Code Nobody Understands
Here’s a pattern I’ve seen across multiple teams: a data pipeline ships, built almost entirely by an AI coding agent. Clean architecture. Full test coverage. Passes every review gate. Two weeks later, a downstream service starts returning stale results. The on-call engineer opens the pipeline code and realizes she can’t explain why it had worked in the first place.
The logic is correct. The tests are green. But the mental model, the shared understanding of why this code makes these decisions, doesn’t exist. The agent wrote it. The team approved it. Nobody internalized it.
She spends three days rebuilding her understanding of a system her team shipped a fortnight earlier. Not debugging. Not refactoring. Just reading, trying to reconstruct the intent behind code that has no human author to ask.
This is a new kind of debt. And it’s accumulating faster than most teams realize.
Three Layers of Debt
Margaret-Anne Storey’s recent paper [1] gives us the framework to talk about this precisely. She proposes three interacting layers of system health, each with its own form of debt:
Technical debt lives in the code. We know this one. Shortcuts in implementation that compromise future changeability. Tight coupling, missing abstractions, copy-pasted logic. It’s visible in the codebase if you look.
Cognitive debt lives in the people. It accumulates when shared understanding of the system erodes faster than it’s replenished. When the team can’t explain why a component behaves the way it does. When institutional knowledge walks out the door and nobody notices until something breaks.
Intent debt lives in the artifacts. It accumulates when the goals and constraints that should guide the system are poorly captured or not maintained. When the requirements doc says one thing, the code does another, and nobody remembers which is right.
These three layers interact. Technical debt makes cognitive debt worse (messy code is harder to understand). Cognitive debt makes intent debt invisible (if nobody understands the system, nobody notices when the specs drift). Intent debt makes technical debt inevitable (if you don’t know what the system should do, every implementation is a guess).
The three layers of system health debt reinforce each other in a vicious cycle
AI-generated code accelerates all three. But cognitive debt is where the damage compounds fastest, because it’s the hardest to see.
Cognitive Surrender
Why does AI-generated code erode understanding so effectively? Wharton researchers Haiyang Shaw and Devin Nave offer a precise mechanism [2].
They extend Kahneman’s dual-process model (System 1 for fast intuition, System 2 for slow deliberation) with a third mode: System 3. This is AI-assisted cognition. And it introduces a failure mode they call cognitive surrender: uncritical reliance on externally generated reasoning that bypasses System 2 entirely.
The distinction matters. Cognitive offloading is strategic. You deliberately delegate a well-understood task to a tool while maintaining the ability to verify the result. You use a calculator for arithmetic. You use GPS for navigation. You could do it yourself; you choose not to.
Cognitive surrender is different. You accept the output without engaging the deliberative process at all. You don’t verify because you can’t. You’ve lost the context needed to evaluate whether the result is correct. The agent wrote the code. It looks reasonable. The tests pass. You approve the PR.
Cognitive offloading is strategic delegation with verification. Cognitive surrender bypasses understanding entirely
This is what’s happening at scale in AI-assisted development. Jeremy Howard, the deep learning pioneer behind fast.ai, puts it bluntly: “Here’s a piece of code that no one understands. Am I going to bet my company’s product on it?” [3]. Anthropic’s own internal study found that engineers using AI scored 17% lower on knowledge quizzes about the software they worked with, with the biggest gap in debugging questions [4]. Correlation, not proven causation — engineers who lean hardest on AI may self-select for less engagement. But the direction is consistent with every other data point: the people shipping the code understand it less than they did before AI helped them write it.
Howard frames this through the lens of desirable difficulty, a concept from learning science. Memories don’t form unless retrieval is effortful. When you write code by hand, you build a mental model through friction: naming variables, choosing abstractions, handling edge cases. When an agent does that work, the friction disappears. And so does the learning. “If you’re not actively using your design and engineering muscles, you don’t grow,” Howard says. “You might even wither” [3].
The Evidence Is Piling Up
I’ve covered the productivity data extensively in previous posts: the METR study showing experienced developers 19% slower with AI [5], DX’s longitudinal finding of ~10% gains across 40 companies [7], Howard’s “tiny uptick” in what people actually ship [3]. The numbers are consistent and well-established. (See The Bottleneck Moved and Code Quality Is the New Infrastructure for the full analysis.)
What those posts didn’t explore is why. The cognitive debt framing gives us the mechanism.
Fowler, after 40+ years in software, describes working with LLMs as collaborating with “a rather dodgy collaborator who’s very productive in the lines-of-code sense of productivity, but you can’t trust a thing that they’re doing” [6]. His advice: work in very thin slices, treat every slice as a PR from an untrustworthy junior, review everything. And then the kicker: “If they were truly a junior developer, which is how sometimes people like to say they should be characterized, I would be having some words with HR” [6].
The generation is visible. The understanding loss is not — until something breaks.
The Verification Shift
If cognitive debt is the problem, what does the organizational response look like?
Ajey Gore frames it as a structural shift [8]. If coding agents make coding free, verification becomes the expensive thing. His formulation is the most concrete version of this I’ve seen:
“Your org chart should reflect this. The team that used to have ten engineers building features now has three engineers and seven people defining acceptance criteria, designing test harnesses, and monitoring outcomes.”
Read that again. Three builders. Seven verifiers. That’s not empirical data from a specific organization — it’s a directional framing. But the direction is right. It’s not a tweak to the development process. It’s an inversion. The scarce resource isn’t someone who can write code. It’s someone who can determine whether the code is correct, complete, and aligned with what the business actually needs.
Gore’s inversion: from a team of builders with one QA to a team of verifiers with agents generating code
This maps to what I’ve been writing about for months. In my piece on the “on the loop” approach [9], I argued that the human role shifts from writing code to designing the constraints that agents operate within. Gore’s framing makes the staffing implication explicit: you need more people on verification than on generation. The bottleneck moved.
Fowler sees the same shift through the lens of non-determinism [6]. Previous tools were predictable: same input, same output. LLMs aren’t. He borrows “tolerance thinking” from structural engineering: you can’t skate close to the edge because bridges will collapse. When your code generator is non-deterministic, you need more verification infrastructure, not less.
Spec-Driven Development as Cognitive Debt Prevention
If cognitive debt accumulates when intent isn’t captured and understanding isn’t built, the antidote is a process that forces both.
This is where spec-driven development enters. Not as a productivity technique, but as a cognitive debt prevention strategy.
The pattern: instead of prompting an agent with “build me a data pipeline,” you decompose the work into explicit requirements, design documents, and acceptance criteria before any code is generated. Each phase produces artifacts that capture intent. The agent executes against those artifacts. When something breaks, you don’t need to reverse-engineer the agent’s reasoning. You check the code against the spec.
Fowler and his colleague Unmesh Joshi take this further with what they call building a ubiquitous language for LLMs [10], borrowing directly from domain-driven design. The insight: LLMs can’t learn chess from plain English game descriptions, but they can from formal chess notation. Rigorous, domain-specific vocabularies produce more reliable agent output than eloquent natural language prompts. As Joshi puts it: “The most creative act is this continual weaving of names that reveal the structure of the solution” [10].
This isn’t just about better prompts. It’s about building shared understanding between humans and agents, and among the humans on the team. When you invest in naming things precisely, in capturing constraints explicitly, in maintaining a vocabulary that maps cleanly to the domain, you’re paying down cognitive debt before it accrues. The spec becomes the institutional memory that survives individual contributors leaving, agents being swapped out, and requirements evolving.
IBM’s framing of “agentic engineering” [11] points in the same direction: the human becomes the architect and orchestrator; the agent handles execution. But the architect role only works if the architecture is captured in artifacts that others can read, challenge, and maintain. Otherwise you’ve just moved the cognitive debt from the code to the architect’s head — a single point of failure with a bus factor of one.
What Architects Should Do Now
Cognitive debt isn’t theoretical. It’s accumulating in every team that ships AI-generated code without investing in understanding. Here’s what I’d recommend:
1. Name the debt. Start using Storey’s three-layer model in architecture reviews. When someone says “the agent wrote it and the tests pass,” ask: “Who on this team can explain why it works?” If the answer is nobody, you have cognitive debt. Make it visible.
2. Measure understanding, not just coverage. Test coverage tells you the code runs. It doesn’t tell you anyone understands it. Add “explanation reviews” to your PR process: before merging AI-generated code, the reviewer must write a one-paragraph summary of what the code does and why. If they can’t, the PR isn’t ready.
3. Invest in specs before generation. Every hour spent on requirements and design documents before the agent runs is an hour of cognitive debt prevention. Capture intent in artifacts that outlive the sprint. Use acceptance criteria as the contract between human intent and agent execution.
4. Build your ubiquitous language. Invest in domain-specific naming conventions, glossaries, and design vocabularies. These aren’t bureaucratic overhead. They’re the interface between your team’s understanding and the agent’s execution. The more precise your language, the less cognitive debt each generation cycle creates.
5. Staff for verification. Gore’s 3:7 ratio may not be your exact number, but the direction is right. You need more people who can evaluate whether code is correct than people who can generate it. Hire for judgment, not just output.
6. Preserve desirable difficulty. Not every task should be delegated to an agent. Deliberately keep some implementation work with human developers, especially for junior engineers building their mental models. The friction of writing code by hand is how understanding forms. Remove all friction and you remove all learning. A useful heuristic: if the task touches a domain boundary, an integration contract, or error handling logic, a human should write it. These are the areas where understanding matters most and where agents make the most confidently wrong decisions.
7. Rotate agent-generated code ownership. Don’t let AI-generated modules become orphans that nobody understands. Assign human owners. Require periodic walkthroughs. Treat understanding as a perishable asset that needs active maintenance.
The teams that will thrive aren’t the ones generating the most code. They’re the ones that maintain the deepest understanding of the code they ship, regardless of who or what wrote it.
Cognitive debt is patient. It waits until the system is under stress, the original team has moved on, and the on-call engineer is staring at code nobody can explain. By then, the interest payments are enormous.
Start paying it down now.
💬 How does your team maintain understanding of AI-generated code? What practices are working?
Sources
[1] Storey, M.-A. “Three Layers of System Health” — arxiv.org/abs/2603.22106
[2] Shaw, H. & Nave, D. “System 3: Cognitive Surrender” — Wharton — SSRN 6097646
[3] Howard, J. “The Dangerous Illusion of AI Coding?” — Machine Learning Street Talk — youtube.com
[4] Anthropic — “AI Assistance and Coding Skills” (January 2026) — anthropic.com
[5] Becker, J. “Why Agent Hype Can Fall Short of Reality” — METR / AI Engineer — youtube.com
[6] Fowler, M. “How AI Will Change Software Engineering” — The Pragmatic Engineer Podcast — youtube.com
[7] DX — “AI Productivity Gains Are 10%, Not 10x” (March 2026) — newsletter.getdx.com
[8] Gore, A. “The Expensive Thing” — ajeygore.in
[9] Christoph, S. “On the Loop, Not In It — But Code Quality Still Matters” — schristoph.online
[10] Fowler, M. & Joshi, U. “Building Abstractions with LLMs” — martinfowler.com
[11] IBM — “What is Agentic Engineering?” (March 2026) — ibm.com
Related Writing
- AI Coding Productivity: 10%, Not 10x — the empirical productivity data behind the 10% finding
- On the Loop, Not In It — the harness engineering approach to agent oversight
- The Citation Crisis — the same verification gap in a different domain