Blog

When Thinking Twice Helps — And When It Doesn't

The Saturday Morning Experiment

Last Saturday, I installed a Python library, pointed it at Amazon Bedrock, and asked a model the same questions three times — with zero, one, and three rounds of self-reflection.

The results surprised me.

  Q  Refl   Time  Acc  Comp  Nuan  Total
  1     0   3.0s    4     3     3     10
  1     1   5.5s    4     2     3      9
  1     3   8.8s    4     3     4     11
  2     0   2.6s    4     2     2      8
  2     1   5.7s    4     2     2      8
  2     3   8.5s    4     2     2      8
  3     0   3.1s    1     1     1      3
  3     1   5.2s    1     1     1      3
  3     3   8.6s    1     1     1      3

Q is the question number, Refl the number of self-reflection rounds (0 = straight answer, 1 = one revision, 3 = three revisions). Acc, Comp, and Nuan are the judge’s scores for Accuracy, Completeness, and Nuance — each on a 1-5 scale, 15 max total.