Is RAG Still Needed with 1M+ Token Context Windows?
The Kofferklausur, Revisited
In September 2024, a colleague asked an audience: “What is RAG?” I answered: Kofferklausur [1].
For non-German speakers: a Kofferklausur is an open-book exam. You bring your textbooks, notes, everything. The exam doesn’t test what you memorized — it tests whether you can find the right information and reason about it under pressure.
That analogy stuck with me. A foundation model is the student. RAG is the suitcase full of books. The model doesn’t need to memorize every fact — it needs to know how to find the right one and reason about it. Special-purpose tools beat the Swiss Army knife.
When Thinking Twice Helps — And When It Doesn't
The Saturday Morning Experiment
Last Saturday, I installed a Python library, pointed it at Amazon Bedrock, and asked a model the same questions three times — with zero, one, and three rounds of self-reflection.
The results surprised me.
Q Refl Time Acc Comp Nuan Total
1 0 3.0s 4 3 3 10
1 1 5.5s 4 2 3 9
1 3 8.8s 4 3 4 11
2 0 2.6s 4 2 2 8
2 1 5.7s 4 2 2 8
2 3 8.5s 4 2 2 8
3 0 3.1s 1 1 1 3
3 1 5.2s 1 1 1 3
3 3 8.6s 1 1 1 3
Q is the question number, Refl the number of self-reflection rounds (0 = straight answer, 1 = one revision, 3 = three revisions). Acc, Comp, and Nuan are the judge’s scores for Accuracy, Completeness, and Nuance — each on a 1-5 scale, 15 max total.
Your AI Models Have an Expiry Date — A Practical Guide to Model Lifecycle Management
Introduction — The Promise I Made
In my previous article [1], I explored the maintenance trap in IT — how software systems are more like plants than stones, requiring constant care. I ended with a cliffhanger: “What is open from the article is how to specifically test and evaluate models — something to be picked up in the next article.”
This is that article.
Since publishing the first piece, something happened that made this topic very real for many of my customers. Anthropic announced the deprecation of Claude 3.5 Sonnet — a model that had become the backbone of countless production applications. Teams that had built their systems around a specific model version suddenly faced a hard deadline to migrate. Some were prepared. Most were not.