The Post-Agile Operating Model: How AI Changes How Teams Ship
- 12 minutes readThe 7x Gap
Last month at the AI Engineer Summit, McKinsey presented findings from a survey of roughly 300 enterprises. The headline number was sobering: most organizations see only 5–15% productivity gains from AI coding tools. That’s it. After the licenses, the hackathons, the executive memos about “AI transformation.” A rounding error.
But buried in the same data was a different story. Top performers weren’t just doing slightly better. They were 7x more likely to have AI-native workflows spanning the entire development lifecycle, and 6x more likely to have restructured their teams around new roles. Their time to market improved 5–6x. One bank case study showed a 51% increase in code merges and a 60x increase in agent consumption after restructuring.
The difference wasn’t the tools. Both groups had access to the same models, the same copilots, the same agents. The difference was the operating model. The top performers had stopped trying to make AI fit inside Agile. They’d built something new.
What Agile Assumed
Agile was a brilliant response to the problems of 2001. Waterfall was killing projects with years of upfront planning that never survived contact with reality. The Agile Manifesto said: work in small increments, get feedback fast, adapt continuously. It worked. For two decades, it was the best framework we had.
But Agile was designed for human-only teams. Its core assumptions reflect that:
Sprint cadence assumes human velocity. Two-week sprints exist because that’s roughly how long it takes a team of humans to design, build, test, and ship a meaningful increment. When agents can produce several days’ worth of code in fifteen minutes (as Adrian Cockcroft demonstrated at QCon), the sprint boundary becomes an artificial constraint, not a useful rhythm.
Story points assume human estimation. Story points measure the effort a human team expects to spend. But effort is no longer the bottleneck. A task that’s “8 points” for a human team might take an agent ten minutes, or might take three days because the codebase is too tangled for the agent to work through. The variance isn’t about human effort anymore. It’s about codebase quality, spec clarity, and agent capability. Story points measure the wrong thing. And to be honest, nobody has a clean replacement yet. Some teams use “number of specs completed” as a proxy, others track throughput (features shipped per week). This is an open problem: leadership still needs forecasts, and continuous planning alone doesn’t solve the capacity question.
Roles assume human specialization. Agile teams have product owners, scrum masters, front-end developers, back-end developers, QA engineers. Each role exists because humans specialize. Nobody can do everything well. But when agents handle implementation, the boundaries blur. At Atlassian, designers with zero prior coding experience now ship 5–6 PRs per week. At OpenAI, a PM ran a bug bash, had Codex collect feedback into Notion, file tickets, assign them, and follow up. The role boundaries that Agile formalized are dissolving.
Ceremonies assume synchronous coordination. Standups, sprint planning, retrospectives: these exist because humans need to synchronize. They need to know what others are working on, resolve blockers face-to-face, and align on priorities. When agents run asynchronously (Uber’s Minions platform kicks off tasks via Slack and delivers completed PRs hours later), the coordination model changes fundamentally. You don’t need a standup to know what an agent did overnight.
None of this means Agile’s values are wrong. Responding to change over following a plan is more relevant than ever. But the practices, the specific ceremonies, cadences, and role structures, were optimized for a world where humans were the only builders. That world is ending.
The Emerging Model
Across the research I’ve been tracking (McKinsey’s enterprise survey, Uber’s platform engineering, Atlassian’s metrics, Netflix’s refactoring workflows, OpenAI’s internal practices, DHH’s work at 37 Signals), a consistent pattern is emerging. It doesn’t have a name yet. But it has five characteristics.
Smaller Pods, Not Smaller Teams
McKinsey’s top performers moved from “two-pizza teams” (8–10 people) to “one-pizza pods” (3–5 people). But this isn’t just headcount reduction. The pod structure reflects a different division of labor: fewer humans, each with broader scope, orchestrating multiple agents.
The role McKinsey calls the “product builder” is someone with full-stack fluency who orchestrates agents rather than writing code in a single specialty. This maps to what Werner Vogels predicted as the “renaissance developer”: a modern polymath who integrates domain knowledge, business constraints, and technical judgment. At Atlassian, Rajeev Rajan describes the same convergence: PMs build prototypes with Replit, designers write code, engineers do product thinking. The Venn diagram overlaps far more than it used to.
Ryan Lopopolo at OpenAI takes this further: his team centralizes around 5–10 deep skills rather than spreading thin across many. The pod isn’t a miniature Agile team. It’s a small group of people who are very good at directing agents in a focused domain.
The obvious risk: concentrating knowledge in fewer people increases fragility. When a key person leaves a 3-person pod, you lose a third of the team’s context overnight. The counterargument is that spec-driven workflows and harness engineering embed institutional knowledge in the codebase itself, constraints, tests, lint rules, and detailed implementation plans that reduce dependence on any individual’s mental model. But that only works if the specs are genuinely maintained. Small pods need disciplined documentation precisely because they can’t afford knowledge loss.
Spec-Driven, Not Story-Driven
The most consistent pattern across every high-performing team: they invest heavily in specifications before any code is generated.
Jake Nations at Netflix describes a three-phase approach (research, plan, implement) where the first two phases are entirely about building a precise spec. The spec is so detailed that “a junior engineer could follow it paint-by-numbers.” Only then do agents execute. His team learned this the hard way: when they tried to let agents refactor Netflix’s authorization system directly, the agents spiraled on tightly coupled code. They had to do one migration by hand first, earn the understanding, and encode it into a spec before agents could handle the rest.
Lopopolo frames this as “harness engineering”: the codebase itself becomes a prompt to the model. Lint error messages include remediation steps. Structural tests enforce file length limits and package boundaries. Every constraint is a guide that shapes agent output. The spec isn’t a document you write once; it’s a living system of constraints embedded in the codebase.
McKinsey’s bank case study showed the same pattern: PMs co-created acceptance criteria with agents before handoff to development, preventing the rework that plagued their earlier approach of handing agents prose-based user stories.
This is the shift from “tell the agent what to build” to “constrain the agent so it can only build the right thing.” I described this in From Cloud-Native to AI-Native as the harness pattern: guides constrain, sensors verify, failures tighten. The spec is the guide. Tests and review agents are the sensors. Every failure makes the harness better.
Continuous Planning, Not Sprint Planning
When implementation takes minutes instead of days, planning can’t wait for a two-week ceremony. McKinsey’s top performers moved from quarterly planning to continuous planning. The bottleneck shifts from “can we build it?” to “should we build it?” And that question needs answering constantly, not every two weeks.
OpenAI’s Codex team reinvents its workflow weekly, chasing shifting bottlenecks. First it was code generation. Then code review. Then understanding user needs. Each solved bottleneck reveals the next. Tibo, the Codex team lead, describes a cycle where the team identifies the current constraint, eliminates it, and immediately starts hunting the next one.
DHH at 37 Signals describes a similar acceleration: projects they’d never have contemplated are now feasible because the cost of exploring a hunch dropped by orders of magnitude. When you can prototype in an afternoon, planning becomes continuous experimentation rather than periodic commitment.
Role Blurring, Not Role Elimination
The fear narrative says AI replaces developers. What’s actually happening is more subtle: AI blurs the boundaries between roles.
At Atlassian, designers ship production code. At OpenAI, PMs run automated bug bashes. At 37 Signals, DHH processed 100 community PRs in 90 minutes, a task that would have taken days and involved multiple reviewers. At Flint, bugs spotted on a morning sales call get fixed by 11 AM.
This isn’t role elimination. It’s role expansion. The product engineer, someone who cares as much about what and why as how, becomes the default, not the exception. Drew from Stripe’s litmus test: when someone describes their career, do they talk about what they built and why, or the technologies they used? In a post-Agile world, the answer increasingly needs to be the former.
Grady Booch, co-creator of UML, frames this historically: every golden age of software engineering was defined by rising abstraction. Assembly programmers feared compilers. Compiler-era devs feared higher-level languages. Each time, the people who understood fundamentals thrived. The current shift is the same pattern at a higher level of abstraction, and the winners are those who design constraints, not those who write code.
Measurement Shifts from Activity to Outcomes
McKinsey found that bottom performers weren’t even measuring speed: only 10% tracked productivity metrics. Top performers measured across four dimensions: inputs (tool and upskilling investment), outputs (velocity, developer NPS), quality (security incidents, mean time to resolve), and economic outcomes (time to revenue, cost per pod).
Uber faces the same measurement challenge. Activity metrics are all up: more diffs, more AI usage, higher NPS. But the CFO asks about revenue impact, not diff counts. They’re now instrumenting feature-level timelines (design → experiment launch) to connect engineering velocity to business outcomes.
Atlassian’s numbers are the most concrete: +89% PRs per engineer, -42% issue cycle time, 51% of security vulnerabilities fixed by agents. But it’s worth being precise about what these actually measure. PR count is a throughput metric, better than lines of code, but still a proxy for value. Cycle time is an efficiency metric that captures how fast real problems get solved. Vulnerability fixes are genuine outcome metrics that connect directly to business risk. The distinction matters: if you only track the first category, you’ll optimize for volume without knowing whether you’re shipping value.
The Human Side
The transition isn’t purely structural. It changes what it feels like to be an engineer.
Chip Huyen describes a senior engineer who still reviews junior team members’ AI-generated code line by line for “mentoring.” The juniors don’t read the feedback. It’s not actionable because they didn’t write the code. The entire PR review workflow is outdated. The senior should be reviewing how juniors instruct AI, not inspecting the output. Code quality becomes prompt quality. Mentoring becomes teaching people how to steer.
This is uncomfortable. It means the skills that made someone a great engineer five years ago (deep knowledge of a specific framework, fast typing, clean syntax) matter less. What matters more: product intuition, systems thinking, the ability to evaluate outcomes without reading every line. As I wrote in The Dawn of the Renaissance Developer, every major abstraction shift raised the same questions: Do I still need my prior knowledge? Is there enough work left for me? The answer has always been yes. But the work changes.
Both DHH and Atlassian’s Rajeev Rajan converge on something unexpected: coding is fun again. Agents remove the drudgery (build errors, npm issues, boilerplate, Xcode setup) and return developers to the joy of building. Rajan bought a personal laptop to code over the holidays. DHH describes the feeling as “intoxicating.” The profession is reconnecting with what drew people in originally.
But there’s a cost dimension too. Thomas Dohmke flags a problem nobody’s solved: the more productive your developers get with agents, the more tokens they burn, and your opex budget explodes. Traditional fixed-cost budgeting doesn’t account for variable token spend that scales with productivity. This creates a perverse incentive to slow down productive developers. The post-Agile operating model needs a post-Agile cost model to match.
What Engineering Leaders Should Do Now
If you’re leading an engineering organization, here’s where to start:
Pilot one pod. Pick a team, shrink it to 3–5 people with broad skills, and let them work spec-driven with agents for one quarter. Measure outcomes (cycle time, defect rate, features shipped), not activity (lines of code, story points completed).
Invest in specs, not stories. Train your teams to write detailed implementation plans (architecture decisions, function signatures, data flow, acceptance criteria) before any agent touches code. Nations’s three-phase approach (research → plan → implement) is a good starting framework.
Rethink your ceremonies. You probably don’t need two-week sprints if implementation takes hours. Experiment with continuous planning: daily priority reviews, weekly bottleneck hunts, monthly strategy alignment. Keep the feedback loops. Lose the fixed cadence.
Measure what matters. Track time-to-feature (from idea to production), defect escape rate, and cost per shipped feature. Stop tracking velocity in story points. It measures the wrong thing when agents do the building.
Redesign code review. The biggest bottleneck in every organization I’ve studied is review, not generation. Uber built an entire platform (Code Inbox + U-Review) to solve this. You don’t need Uber’s scale, but you need a plan for how humans review agent output without becoming the constraint.
Budget for tokens. AI costs at Uber went up 6x since 2024. Build variable token spend into your cost model now, before the CFO asks why your cloud bill doubled.
The Through-Line
Agile was the right answer for human-only teams building software in two-week increments. It served us well for twenty years. But AI agents break its core assumptions: velocity, estimation, specialization, and coordination.
The organizations pulling ahead aren’t tweaking Agile. They’re replacing it with something new: smaller pods of renaissance developers, spec-driven workflows that constrain agent output, continuous planning that matches the speed of AI-assisted delivery, and outcome-based measurement that connects engineering velocity to business value.
The operating model is the bottleneck. The tools are ready. The question is whether your organization is.
Sources
- McKinsey, “Moving Away from Agile: What’s Next” (AI Engineer Summit, April 2026)
- Uber, “Leading Engineering Through an Agentic Shift” (Pragmatic Engineer Summit)
- Atlassian / GitHub, “Building World-Class Engineering Teams in the Age of AI” (Pragmatic Engineer Summit)
- Netflix. Jake Nations, “The Infinite Software Crisis” (AI Engineer Summit)
- OpenAI, “How AI Is Reshaping the Craft of Building Software” (Pragmatic Engineer)
- DHH, “A New Way of Writing Code” (Pragmatic Engineer)
- Chip Huyen, “Building When It Feels Like There’s Nothing Left” (Pragmatic Engineer)
- Grady Booch, “The Third Golden Age of Software Engineering” (Pragmatic Engineer)
- Ryan Lopopolo, “Harness Engineering” (AI Engineer Summit)
- Product-Minded Engineers panel (Pragmatic Engineer Summit)
- Christoph, S. — From Cloud-Native to AI-Native
- Christoph, S. — The Dawn of the Renaissance Developer
What’s your experience? Are you still running two-week sprints with AI agents, or have you started restructuring? I’d love to hear what’s working and what isn’t.
❤️ Created with the support of AI (Kiro)
📝 Last updated: May 2, 2026 — Minor edits