Intelligence Is About Time, Not Parameters

May 14, 2026 - 11 minutes read

The Question Every SA Gets

The savant regime in AI models — Beyond a complexity threshold, larger models become less insightful — the savant regime.

“Which model should I use?”

I hear it in almost every customer conversation about generative AI. The instinct is always the same: reach for the biggest model. More parameters, more intelligence. It feels right. It’s also wrong, and now there’s a mathematical proof to explain why.

A paper by Stefano Soatto (VP, AWS Agentic AI) and Alessandro Achille (Principal Applied Scientist, AWS) makes a claim that should change how we think about AI architecture: beyond a certain complexity threshold, larger models become less insightful, not more [1]. They call this the “savant regime,” where benchmark scores keep climbing but genuine reasoning declines. The model solves problems through brute force, not understanding.

The paper was published on February 25, 2026, and the Amazon Science blog post that accompanies it is one of the clearest pieces of AI research writing I’ve read this year [2]. It deserves a close read. Here’s what it means for practitioners.

What the Savant Regime Actually Is

Think about what happens when you scale a model to extreme sizes with unlimited compute. At the theoretical limit, you arrive at what Ray Solomonoff described in 1964: execute all possible programs, average the outcomes of those that reproduce the observed data. That gives you the universally optimal answer to any problem.

It also takes forever. And it requires zero learning.

That’s the savant regime. The model can solve anything, but it does so by exhaustive search, not by understanding the structure of the problem. It’s the AI equivalent of a student who memorizes every possible exam answer instead of learning the subject.

Soatto and Achille argue that naive parameter scaling pushes models toward this regime. Performance on benchmarks improves because the model has more capacity for brute-force pattern matching. But the algorithmic insight, the ability to recombine learned methods in novel ways, actually decreases.

“If scale does not lead to intelligence, what does? We argue that the answer is time.”

The Equation That Changes Everything

Convergence of algorithmic mutual information — The central result: log speed-up equals the algorithmic mutual information between training data and new tasks.

The paper’s central result is compact:

log speed-up = I(h : D)

Where h is the solution to a new task, D is the training dataset, and I(h : D) is the algorithmic mutual information between them.

To understand what “algorithmic mutual information” means, you need a quick detour through information theory. Claude Shannon’s 1948 framework measures information as a property of distributions, how uncertain you are about a variable’s value. But what if you only have a single data sample and no distribution? In the 1960s, Solomonoff and Kolmogorov independently proposed an alternative: algorithmic information. The information content of a binary string is the length of the shortest program that generates it on a universal Turing machine. The algorithmic mutual information between two strings measures how much shorter the program for one becomes when you have access to the other.

Back to the equation. In plain language: the faster a model can solve a new problem (compared to brute force), the more genuine algorithmic knowledge it has extracted from its training data. Speed isn’t just a nice-to-have. It’s a direct measure of how much the model has actually learned.

A caveat: Kolmogorov complexity is uncomputable in the general case. You can’t directly measure I(h : D) for a real model. But the practical implication is clear and actionable: optimize for inference speed as a proxy for intelligence. That you can measure.

This means that during training, minimizing inference time forces the model to encode real algorithmic structure into its weights, not just statistical patterns, but reusable methods it can recombine for problems it has never seen. The model that solves a problem in three steps has learned more than the model that solves it in three thousand.

The intellectual lineage is worth noting. Solomonoff (1964) showed that universal transductive inference is optimal but takes forever. Levin (1973), in the same paper where he introduced NP completeness, derived a universally fast algorithm, but it’s impractical and involves no learning. Solomonoff hinted in 1986 that learning could reduce time. This paper closes the loop: it proves how learning reduces time, and shows that the reduction is proportional to the algorithmic information shared between training data and new tasks.

Why This Inverts Everything We Thought We Knew

Classical machine learning theory is built on a single fear: overfitting. The standard prescription is regularization, minimize the information the model retains from training data beyond what’s needed for the loss function. Memorization is the enemy. Simplicity is the goal.

This paper says the opposite.

In the transductive inference framework Soatto and Achille describe, memorization is a value. The goal is to maximize the information retained from training data, because any piece of it might be useful for solving a future problem. The model isn’t trying to generalize from past to future (induction). It’s trying to reason through past data to craft solutions to new problems (transduction).

The distinction maps cleanly onto the system-1/system-2 framework from cognitive psychology. Induction is system-1: fast, reactive, automatic. You see a cat, you recognize a cat. Transduction is system-2: slow, deliberative, query-specific. You see a novel math problem, you work through it step by step.

Chain-of-thought reasoning in LLMs? That’s transductive inference. The model performs variable-length, query-specific computation, longer chains for harder problems, shorter ones for easy ones. And the paper proves that optimizing for the shortest correct chain is what produces genuine intelligence.

The Zebra and the Cost of Time

Hourglasses representing the cost of time in AI inference — Intelligence is about calibrating your time investment to the environment.

For practitioners, the paper doesn’t prescribe a single optimal inference time. It argues that the cost of time is subjective and environment-dependent.

The authors use a vivid analogy: a zebra drinking from a pond doesn’t know how long it has before a predator spots it. Linger too long, become prey. Panic and leave, become dehydrated. Intelligence is about calibrating your time investment to the environment.

For AI systems, this means different tasks demand different time budgets. Scientific discovery operates on a time constant of centuries. Algorithmic trading operates in milliseconds. A customer support chatbot and a drug discovery pipeline shouldn’t use the same model with the same inference budget, not because of cost, but because the optimal intelligence strategy is different for each.

The paper introduces a useful concept: “traits of non-intelligence” (TONIs), behaviors whose presence negates intelligence regardless of how you define it. Taking the same non-minimal time and energy to solve repeated instances of the same task? TONI. Spending the same effort on a trivial task as on a complex one? TONI. Starting a computation known to take longer than the lifetime of the universe? Definitely a TONI. These aren’t philosophical abstractions, they’re engineering requirements. A well-designed system should allocate resources proportional to task difficulty.

The practical implication: models should be trained to predict the marginal value of one more step of computation. They should condition on a target complexity, providing answers within customer-specified cost bounds. And they should be able to spawn specialized sub-models for specific task classes, rather than routing everything through one giant model.

To be clear: the paper provides the theory, not the engineering. How you determine the right time budget for a specific task is still an open problem. In practice, you’d need empirical calibration, run the same task at different inference budgets, measure quality, find the knee of the curve. The framework tells you why this matters. The tooling to do it well is still being built.

The Jensen Huang Connection

Earlier this month, I wrote about Jensen Huang’s interview with Dwarkesh Patel [3], where he described Nvidia’s mental model: “The input is electrons, the output is tokens. In the middle is Nvidia.” Huang’s thesis is that more compute produces more intelligence, and he predicted token generation costs could drop by a factor of a billion over the next decade.

Soatto and Achille add important nuance to that picture. More compute can produce more intelligence, but only if the model is optimized for time, not just scale. Without time pressure, you get the savant regime: more compute spent on brute force, not better reasoning. The billion-fold cost reduction Huang envisions is only valuable if the models using that compute are designed to minimize inference time per unit of insight.

The savant regime is the failure mode Huang doesn’t talk about. It’s what happens when you scale compute without scaling intelligence.

The Reasoning Model Connection

This paper also provides the theoretical foundation for something I’ve been exploring in recent posts. In “When Thinking Twice Helps” [5], I tested self-reflection on Amazon Bedrock and found a clear pattern: more thinking time helps when the model already has the relevant knowledge, and does nothing when it doesn’t. Self-reflection is a reasoning amplifier, not a knowledge injector.

The Soatto-Achille framework explains why. Chain-of-thought reasoning is transductive inference, the model searching through its learned algorithmic methods to find the shortest path to a correct answer. If the relevant methods aren’t in the weights (because the knowledge is too recent or too niche), no amount of inference time helps. The model has nothing to search through.

In a forthcoming post on what reasoning actually means [4], I explore this spectrum in more depth, from pattern recall through trained reasoning to genuine abstraction. The Soatto-Achille paper sits underneath all of it as the theoretical layer. It tells us why reasoning models like o3 and Claude’s extended thinking work: they’re trading parameters for inference time, which is exactly what the math says you should do.

And it tells us when they’ll fail: when the task requires knowledge the model never learned, no amount of time-trading helps. You need retrieval, tools, or symbolic grounding, which is exactly the argument I made in “Is RAG Still Needed?” [6].

What This Means If You’re Running AI on AWS

This isn’t just theory. The authors, Soatto and Achille, are AWS scientists. This research directly informs how Bedrock is built. Here’s what it means for architecture decisions:

Right-size your models based on task complexity, not model size. The paper proves that a smaller model under time pressure can be more “intelligent” (in the algorithmic information sense) than a larger model with unlimited compute. When a customer asks “should I use Sonnet or Haiku?”, the answer isn’t always the bigger model. For latency-sensitive tasks, the smaller model forced to reason efficiently may actually produce better results.

Use inference profiles for routing. Bedrock’s inference profiles let you route different tasks to different models. The paper’s framework gives you a principled way to think about this: high-complexity tasks get more inference budget, simple tasks get less. Not just for cost, for intelligence.

Inference cost is an intelligence incentive. This is the most counterintuitive implication. Paying per token isn’t just a billing model. It’s a pressure that forces models toward genuine reasoning. The paper argues that imposing a cost on time should ultimately improve absolute performance on new tasks, not impair it. Your cost optimization strategy and your intelligence optimization strategy are the same strategy.

Think in terms of model portfolios, not single models. The paper envisions agents that spawn specialized sub-models for specific task classes. This maps directly to agentic architectures on Bedrock, an orchestrator that routes to specialized models based on task complexity, with each model optimized for its time budget. The paper was released alongside AI Functions, an open-source library from Amazon’s Strands Labs for building agents with natural-language function bodies governed by pre- and post-conditions. Minimizing time is at the core of the design, which makes sense now that you understand the theory behind it.

The Question That Matters

The old question was: “Which model is biggest?”

The new question is: “Which model is fastest for this task?”

That reframe changes how you architect AI systems. It shifts the optimization target from parameters to time, from scale to efficiency, from brute force to genuine reasoning.

The next time a customer asks me which model to use, I’ll point them to this paper. The answer isn’t the model with the most parameters. It’s the model that solves their specific problem in the fewest steps.

What’s your model selection strategy? Are you optimizing for size, or for time?

Sources

[1] Achille, A. & Soatto, S. — “AI Agents as Universal Task Solvers” (2025, revised Feb 2026). Paper: arxiv.org/abs/2510.12066

[2] Soatto, S. & Achille, A. — “Intelligence isn’t about parameter count. It’s about time.” Amazon Science, February 25, 2026. amazon.science

[3] My earlier post on Jensen Huang’s platform strategy: schristoph.online/blog/nvidia-moat-jensen-huang/

[4] Forthcoming: “What Reasoning Actually Means (and Why It Matters for Your Architecture)” — schristoph.online

[5] My experiment with inference-time self-reflection on Bedrock: schristoph.online/blog/when-thinking-twice-helps/

[6] My analysis of RAG vs long context windows: schristoph.online/blog/is-rag-still-needed/

❤️ Created with the support of AI (Kiro)

📝 Last updated: May 2, 2026 — Minor edits