LLMs Don't Do Math — They Predict What Math Looks Like
The Invisible Error
To test this, I designed five calculations that anyone in business might ask an AI assistant — the kind of questions you’d type into ChatGPT or Claude expecting a quick, reliable answer:
- Simple arithmetic — 7 × 8 (baseline sanity check)
- A discount calculation — “What’s the final price of a €249.99 item with 15% off?” (retail, e-commerce)
- Compound interest — “How much is €10,000 worth after 7 years at 3.5%?” (investment planning)
- A mortgage payment — “What’s the monthly payment on a €250,000 loan at 3.8% over 25 years?” (the kind of number people make life decisions on)
- Standard deviation — of a 10-number dataset (basic statistics, common in reporting)
I ran each calculation through two models on Amazon Bedrock: Amazon Nova Micro ($0.046/1M input tokens) and Claude Sonnet 4 ($3.00/1M input — roughly 65x more expensive). Prices are on-demand rates at the time of writing [4]. The choice of models isn’t a judgment on either — both are excellent at what they’re designed for. The point is to show that this is a structural limitation of how language models work, not a quality issue with any specific model. A small model gets it wrong more often. A large, expensive model gets it wrong less often. But neither is computing — both are predicting. The error shrinks with scale but doesn’t disappear, because the architecture is fundamentally probabilistic.