Blog
read more
When Thinking Twice Helps — And When It Doesn't
The Saturday Morning Experiment
Last Saturday, I installed a Python library, pointed it at Amazon Bedrock, and asked a model the same questions three times — with zero, one, and three rounds of self-reflection.
The results surprised me.
Q Refl Time Acc Comp Nuan Total
1 0 3.0s 4 3 3 10
1 1 5.5s 4 2 3 9
1 3 8.8s 4 3 4 11
2 0 2.6s 4 2 2 8
2 1 5.7s 4 2 2 8
2 3 8.5s 4 2 2 8
3 0 3.1s 1 1 1 3
3 1 5.2s 1 1 1 3
3 3 8.6s 1 1 1 3
Q is the question number, Refl the number of self-reflection rounds (0 = straight answer, 1 = one revision, 3 = three revisions). Acc, Comp, and Nuan are the judge’s scores for Accuracy, Completeness, and Nuance — each on a 1-5 scale, 15 max total.