What Reasoning Actually Means (and Why It Matters for Your Architecture)

Mon, 11 May 2026 00:00:00 +0000

It Started with a Saturday Morning Experiment

I recently ran a simple test. I asked a small language model the same questions three times, with zero, one, and three rounds of self-reflection, and published the results. The pattern was clear: self-reflection helped when the model already knew the topic. It did nothing when it didn’t. And on bleeding-edge questions, more thinking just produced more confidently wrong answers.

That experiment raised a question I couldn’t shake: if “thinking harder” only works sometimes, what exactly is happening when a model reasons, and when is it just pretending?

LLMs Don't Do Math — They Predict What Math Looks Like

Wed, 08 Apr 2026 00:00:00 +0000

The Invisible Error

To test this, I designed five calculations that anyone in business might ask an AI assistant, the kind of questions you’d type into ChatGPT or Claude expecting a quick, reliable answer:

Simple arithmetic — 7 × 8 (baseline sanity check)
A discount calculation — “What’s the final price of a €249.99 item with 15% off?” (retail, e-commerce)
Compound interest — “How much is €10,000 worth after 7 years at 3.5%?” (investment planning)
A mortgage payment — “What’s the monthly payment on a €250,000 loan at 3.8% over 25 years?” (the kind of number people make life decisions on)
Standard deviation — of a 10-number dataset (basic statistics, common in reporting)

I ran each calculation through two models on Amazon Bedrock: Amazon Nova Micro ($0.046/1M input tokens) and Claude Sonnet 4 ($3.00/1M input, roughly 65x more expensive). Prices are on-demand rates at the time of writing [4]. The choice of models isn’t a judgment on either, both are excellent at what they’re designed for. The point is to show that this is a structural limitation of how language models work, not a quality issue with any specific model. A small model gets it wrong more often. A large, expensive model gets it wrong less often. But neither is computing, both are predicting. The error shrinks with scale but doesn’t disappear, because the architecture is fundamentally probabilistic.

schristoph.online

What Reasoning Actually Means (and Why It Matters for Your Architecture)

It Started with a Saturday Morning Experiment

LLMs Don't Do Math — They Predict What Math Looks Like

The Invisible Error