What Reasoning Actually Means (and Why It Matters for Your Architecture)
- 9 minutes readIt Started with a Saturday Morning Experiment
I recently ran a simple test. I asked a small language model the same questions three times, with zero, one, and three rounds of self-reflection, and published the results. The pattern was clear: self-reflection helped when the model already knew the topic. It did nothing when it didn’t. And on bleeding-edge questions, more thinking just produced more confidently wrong answers.
That experiment raised a question I couldn’t shake: if “thinking harder” only works sometimes, what exactly is happening when a model reasons, and when is it just pretending?
I spent several weeks pulling on that thread. I went through nine research conversations, from François Chollet on ARC benchmarks to Apple’s GSM-Symbolic paper, from Anthropic’s interpretability work to UT Austin’s neuro-symbolic architectures. What emerged wasn’t a simple answer. It was a framework.
Reasoning Is a Spectrum, Not a Switch
The most useful framing I found comes from Swarat Chaudhuri (UT Austin / Google DeepMind). He argues that reasoning isn’t a Boolean property. It’s not something a system either has or doesn’t. It’s a quantitative spectrum, measured by performance on tasks historically associated with deeper thinking: math, programming, planning [1].
This sidesteps the endless “does it really reason?” debate. Instead of arguing about consciousness or understanding, you ask: how well does this system perform on tasks that require multi-step logic, and how stable is that performance when you change the surface details?
On one end of the spectrum: pattern recall. The model has seen “7 × 8 = 56” thousands of times and reproduces it. No reasoning required, just memory. On the other end: genuine abstraction. The system encounters a novel problem, identifies the relevant structure, and applies principles it has never seen combined in exactly this way.
Most real-world AI use cases fall somewhere in between. And that “somewhere” determines your architecture.
Why Thinking Harder Helps (Sometimes)
OpenAI’s o-models represent a genuine step forward on this spectrum. François Chollet, the creator of the ARC benchmark designed specifically to test reasoning beyond memorization, considers them “far beyond the classical deep learning model” [2]. His theory of how they work: the model performs search in chain-of-thought space, evaluating branches, backtracking when paths fail, and producing what amounts to a natural-language program that it then executes.
The telltale sign: compute scales with problem difficulty. Easy questions get short chains. Hard questions get long ones. This is fundamentally different from a static forward pass. The model allocates more resources to harder problems, which is exactly what my self-reflection experiment showed empirically.
But there’s a ceiling, and it’s lower than you’d think.
Apple researcher Iman Mirzadeh demonstrated this with GSM-Symbolic [3], a variant of a standard math benchmark where the logic stays identical but names and numbers change. The results were striking: simply swapping “Amy” for “John” or “apples” for “bananas” caused significant performance variance. Adding a single irrelevant sentence to the problem caused massive accuracy drops. Even providing examples of how to ignore distractors didn’t help.
The implication: the model isn’t solving the math. It’s matching the pattern of what math solutions look like. When the surface changes, the pattern match degrades.
This doesn’t contradict Chollet’s assessment that o-models are a genuine advance. Both things are true: o-models raised the ceiling on what pattern-based search can achieve, but the ceiling is still there. The search is better; the substrate it searches over is still learned patterns.
Trenton Bricken (Anthropic) and Sholto Douglas (Google/Gemini) converge on a related insight: chain-of-thought is unreliable as a reasoning trace [4]. Multiple papers show models producing correct answers even when their visible reasoning is garbled or ablated. The internal representations (the KV cache, the hidden states) carry information the text tokens don’t surface. The model may reach the right answer, but the “reasoning” you see isn’t necessarily the reasoning it did.
This matters for trust. If you’re building a system where the reasoning trace is part of the audit trail (compliance, medical, legal), the visible chain of thought may not reflect the actual computation. You’re reading a plausible narrative, not a faithful log.
Three Approaches to Better Reasoning
The research points to three distinct strategies, each with different cost-performance tradeoffs:
1. More Thinking Time (Test-Time Compute)
Give the model more compute at inference. This is what o-models do with chain-of-thought search, what self-reflection does with iterative revision, and what MindsAI’s test-time fine-tuning does by actually updating model weights per-problem [5].
The MindsAI result stands out: a 340M parameter model, tiny by current standards, achieved 58% on ARC by fine-tuning itself on each puzzle’s examples at inference time. That’s a 300% improvement over the base model, achieved not by scaling up but by spending compute where it matters.
Jonas Hubotter’s SIFT framework [6] takes this further with a principled approach: use Bayesian uncertainty to select maximally informative data for local fine-tuning, spending compute only where the model is uncertain. Small models with SIFT beat models 30× their size.
When it works: The model has the right priors but needs to tune its reasoning. Well-trodden problem spaces. Tasks where the answer is within the learned distribution but requires careful multi-step logic.
When it fails: The knowledge isn’t there. No amount of thinking time helps a model reason about concepts it has never encountered, as my MCP Sampling experiment showed (3/15 across all reflection depths, with increasingly confident hallucinations).
2. Better Training (Post-Training and RL)
John Schulman (OpenAI) makes a compelling case that most of GPT-4’s improvement since launch came from post-training (data quality, annotation iteration, reinforcement learning), not pre-training changes [7]. Post-training is where the model learns how to reason, not just what to say.
The key insight: bigger models generalize more and memorize less. A larger model develops better shared representations, so training on English generalizes to Spanish, and ~30 examples of “I can’t send emails” generalized to all capability limitations. This means the quality of your fine-tuning data matters more than the quantity.
When it works: You have domain-specific data and can invest in a training pipeline. The reasoning patterns you need are learnable from examples.
When it fails: The task requires genuine novelty, problems that don’t resemble anything in training. And training is expensive, slow, and requires infrastructure that many teams don’t have.
3. Neuro-Symbolic Grounding (Tools and Verification)
Chaudhuri argues the term “neuro-symbolic” is becoming unnecessary because the pattern is now ubiquitous [1]: neural networks handle perception and language, symbolic tools provide grounding and verification. AlphaProof uses neural models grounded by a Lean theorem prover. Code-generating agents use Python interpreters as symbolic tools. The interesting question isn’t whether to combine neural and symbolic, but what precise form the combination takes.
This is exactly the pattern I described in LLMs Don’t Do Math: the model interprets the question, a deterministic system computes the answer. The LLM thinks; the calculator calculates.
Kevin Ellis (Cornell/Basis) and Zenna Tavares extend this with a verify-then-fallback ensemble [8]: try program synthesis first (because you can verify the output against test cases), fall back to neural prediction only when synthesis fails. On ARC, the two approaches solve substantially different problems. Algorithmic tasks favor programs, perceptual tasks favor neural methods, and the ensemble outperforms either alone.
When it works: Tasks with verifiable outputs. Math, code, data queries, anything where you can check the answer against ground truth.
When it fails: Open-ended generation where there’s no clear verification criterion. Creative writing, strategy, subtle judgment: tasks where “correct” is subjective.
What This Means for Your Architecture
Here’s the practical framework. For any AI use case, ask two questions:
1. Where does this task fall on the reasoning spectrum?
| Task Type | Spectrum Position | Example |
|---|---|---|
| Pattern recall | Low | FAQ answers, classification, simple extraction |
| Trained reasoning | Middle | Code generation, summarization, standard analysis |
| Novel reasoning | High | New domain problems, multi-step logic on unseen data |
2. What’s the failure mode you can’t tolerate?
| Failure Mode | Mitigation Strategy |
|---|---|
| Subtle errors in well-known domains | More thinking time (self-reflection, extended thinking) |
| Missing domain knowledge | Better training (fine-tuning) or retrieval (RAG) |
| Incorrect calculations or logic | Symbolic grounding (tool use, code execution, verification) |
| Hallucination on novel topics | Don’t rely on the model alone; add external knowledge sources |
The architectural implications:
- Pattern recall tasks don’t need extended thinking. A fast, cheap model with good retrieval is the right architecture. Adding reflection just adds latency and cost.
- Trained reasoning tasks benefit from thinking time and better training. This is where the cost-performance tradeoff gets interesting: a small model with self-reflection can match a large model at a fraction of the cost (the Nova Micro result from my experiment).
- Novel reasoning tasks need all three: thinking time to explore the solution space, training to build the right priors, and symbolic grounding to verify results. This is where agentic architectures with tool use earn their complexity.
But not every use case needs all three. If your task is document classification with well-defined categories, a fine-tuned small model with no reflection or tool use is the right answer. Adding an agentic pipeline would add latency and cost for zero accuracy gain. The framework’s value is in matching strategy to task, not stacking all three by default.
The mistake I see most often: teams treating reasoning as a model selection problem. “We need better reasoning, so let’s upgrade to a bigger model.” Sometimes that works, scaling does improve reasoning, and the jump from GPT-3 to GPT-4 proved that. But often the bottleneck isn’t the model’s reasoning capacity. It’s the absence of grounding. A model that “thinks harder” about a problem it fundamentally can’t verify will just hallucinate more eloquently.
The Uncomfortable Truth
Mirzadeh’s framing stays with me: intelligence is the slope of learning, not the point on a benchmark [3]. A system that scores 90% on a standard test but collapses when you change the variable names isn’t reasoning. It’s memorizing. A system that scores 60% but handles novel variations consistently is, in a meaningful sense, more intelligent.
For practitioners, this means: stop optimizing for benchmark scores. Start testing on your edge cases, the domain-specific, novel, adversarial inputs that your users will actually throw at the system. That’s where you’ll discover whether your architecture reasons or just recalls.
The models will keep getting better. The spectrum will keep shifting. But the architectural principle won’t change: understand what kind of reasoning your use case requires, and design the system around that, not around the model.
Sources
[1] Swarat Chaudhuri — “What is Reasoning?” (MLST): youtube.com
[2] François Chollet — “OpenAI o-models and ARC” (MLST): youtube.com
[3] Iman Mirzadeh — “Moving Beyond Surface Statistics” (MLST / Apple): youtube.com
[4] Sholto Douglas & Trenton Bricken — “How LLMs Actually Think” (Dwarkesh Patel): youtube.com
[5] Muhammad Jamal — “Test-Time Adaptation” (MLST / MindsAI): youtube.com
[6] Jonas Hubotter — “Test-Time Adaptation: A New Frontier” (MLST): youtube.com
[7] John Schulman — “Reasoning, RLHF & Plan for AI” (Dwarkesh Patel): youtube.com
[8] Kevin Ellis & Zenna Tavares — “Why Program Synthesis Is Next” (MLST): youtube.com
[9] My self-reflection experiment: schristoph.online
[10] My experiment on LLMs and math: schristoph.online
❤️ Created with the support of AI (Kiro)
📝 Last updated: May 2, 2026 — Minor edits