What Reasoning Actually Means (and Why It Matters for Your Architecture)
It Started with a Saturday Morning Experiment
I recently ran a simple test. I asked a small language model the same questions three times, with zero, one, and three rounds of self-reflection, and published the results. The pattern was clear: self-reflection helped when the model already knew the topic. It did nothing when it didn’t. And on bleeding-edge questions, more thinking just produced more confidently wrong answers.
That experiment raised a question I couldn’t shake: if “thinking harder” only works sometimes, what exactly is happening when a model reasons, and when is it just pretending?
LLMs Don't Do Math — They Predict What Math Looks Like
The Invisible Error
To test this, I designed five calculations that anyone in business might ask an AI assistant, the kind of questions you’d type into ChatGPT or Claude expecting a quick, reliable answer:
- Simple arithmetic — 7 × 8 (baseline sanity check)
- A discount calculation — “What’s the final price of a €249.99 item with 15% off?” (retail, e-commerce)
- Compound interest — “How much is €10,000 worth after 7 years at 3.5%?” (investment planning)
- A mortgage payment — “What’s the monthly payment on a €250,000 loan at 3.8% over 25 years?” (the kind of number people make life decisions on)
- Standard deviation — of a 10-number dataset (basic statistics, common in reporting)
I ran each calculation through two models on Amazon Bedrock: Amazon Nova Micro ($0.046/1M input tokens) and Claude Sonnet 4 ($3.00/1M input, roughly 65x more expensive). Prices are on-demand rates at the time of writing [4]. The choice of models isn’t a judgment on either, both are excellent at what they’re designed for. The point is to show that this is a structural limitation of how language models work, not a quality issue with any specific model. A small model gets it wrong more often. A large, expensive model gets it wrong less often. But neither is computing, both are predicting. The error shrinks with scale but doesn’t disappear, because the architecture is fundamentally probabilistic.