Your AI Judge Needs a Judge

written by Stefan Christoph

May 22, 2026 - 11 minutes read

TL;DR: Most teams ship LLM judges without testing them against human labels. The result: judges that are confidently, consistently wrong on 30%+ of cases. Hamel Husain’s “critique shadowing” methodology (pass/fail + written critiques from a domain expert) builds trustworthy judges. Amazon Bedrock Model Evaluation handles model comparison at scale. You need both — plus periodic human review to catch drift.

You would not ship a feature without testing it. You would not deploy a model without benchmarking it. But most teams ship their LLM judges without verifying them at all.

Hamel Husain put it bluntly in a recent post: “Shipping an unverified LLM judge is the fastest way for people to lose trust in your evals (and your work in general)” [1]. Husain is not a casual observer. He is a machine learning engineer with over 20 years of experience, previously at GitHub and Airbnb, whose early LLM research was used by OpenAI for code understanding. His AI Evals course on Maven has trained over 3,000 students from companies including OpenAI, Anthropic, and Google [2]. When he says the industry has an eval problem, the industry has an eval problem.

I work with enterprise customers building AI applications on AWS. The pattern I see is consistent: teams invest heavily in model selection, prompt engineering, and RAG pipelines, then bolt on evaluation as an afterthought. The judge prompt gets written once, never tested against human labels, and becomes the single source of truth for whether the system works. Nobody asks the obvious question: does the judge actually agree with the humans it is supposed to represent?

This post bridges two worlds. Husain’s practitioner methodology for building trustworthy LLM judges, and Amazon Bedrock’s managed evaluation capabilities that automate the scoring at scale. They solve different halves of the same problem. You need both.

The Problem: Judges That Agree With Themselves

Here is a number that should make you uncomfortable: 91%.

A team evaluating a Hindi voice assistant found that their LLM judge agreed with itself 91% of the time across thousands of calls. Impressive consistency. Then they compared the judge’s scores against human labels. Agreement dropped to 64% [1]. The judge was confidently, consistently wrong about more than a third of the cases.

Self-consistency is not accuracy. A judge that always says “pass” is perfectly self-consistent. It is also useless.

The root cause is that LLM judges carry systematic biases that are invisible until you test for them:

Verbosity bias — longer responses get higher scores regardless of quality
Position bias — the first or last option in a comparison is preferred
Self-enhancement bias — outputs from the same model family score higher
Anchoring — the scoring rubric’s examples influence the judge more than the actual content

These are not edge cases. They are default behaviors. And they compound. A judge with verbosity bias will systematically reward models that pad their answers, which means your “best” model might just be your most verbose one.

Husain’s Methodology: Start With Pass/Fail

Husain’s approach, which he calls “critique shadowing,” is deliberately simple [3]. It starts with a single question: did the AI achieve the desired outcome? Not “rate this on a scale of 1 to 5 across eight dimensions.” Just pass or fail.

Flowchart showing Husain critique shadowing methodology: expert labels, build judge, measure agreement, iterate — Husain’s critique shadowing methodology: start with human pass/fail judgments, build the judge from their reasoning.

The key insight is that the critique matters more than the score. When a domain expert writes “this summary omits the CVE number, the CVSS score, and the active exploitation status — a reader would not understand the severity,” that critique becomes a few-shot example for the LLM judge. The judge learns what “fail” looks like from the expert’s reasoning, not from an abstract rubric.

Husain is explicit about what not to do: “If someone says you need to measure 8 things on a 1-5 scale, they don’t know what they are looking for” [3]. The 1-5 scale feels rigorous. It is not. Nobody can reliably distinguish a 3 from a 4, and the resulting metrics are not actionable. Binary pass/fail with a written critique forces clarity.

His most provocative claim: “The real business value comes from looking at your data. I would go as far as saying that creating a LLM judge is a nice hack I use to trick people into carefully looking at their data” [3]. The judge is a means, not an end. The process of building it forces the team to articulate what good looks like.

Building a DIY Judge on Bedrock

Let me make this concrete. Here is a working LLM judge built on Amazon Bedrock that follows Husain’s methodology. The full code is on GitHub [4].

The setup: 10 tech news articles, each with a generated summary. 8 summaries are good (produced by Amazon Nova Lite). 2 are intentionally vague — the kind of output that looks fine at a glance but fails on closer inspection. Human labels mark which is which.

The judge prompt follows Husain’s pattern: clear pass/fail criteria, few-shot examples with critiques, and a requirement to explain the reasoning.

JUDGE_PROMPT = """\
You are an expert summary evaluator. Your job is to judge whether a 
generated summary accurately captures the key information from the 
original article.

## Evaluation Criteria
A summary PASSES if it:
- Captures the main facts and key numbers from the article
- Is factually accurate (no hallucinated details)
- Covers the most important implications or takeaways

A summary FAILS if it:
- Misses major facts or key numbers
- Is too vague or generic (could apply to any article on the topic)
- Omits critical context that changes the meaning

## Examples

### Example 1 — PASS
Article: "AWS Lambda now supports 30 GB memory and 8 vCPUs..."
Summary: "AWS Lambda doubles its resource limits to 30 GB memory 
and 8 vCPUs, with a new 20% discounted pricing tier."
Judgment: pass
Critique: Captures both key facts and the pricing change. Concise 
and accurate.

### Example 2 — FAIL
Article: "A critical OpenSSL vulnerability (CVE-2026-0001, CVSS 9.8)
enables remote code execution..."
Summary: "A security issue was found in OpenSSL. Users should update."
Judgment: fail
Critique: Omits CVE number, CVSS score, attack vector, active 
exploitation status, and affected versions. Too vague to be actionable.

## Your Task
Evaluate the following summary. Respond with ONLY a JSON object:
{"judgment": "pass" or "fail", "critique": "your detailed explanation"}
"""

The invocation uses Bedrock’s Converse API with Amazon Nova Pro as the judge:

import boto3, json

client = boto3.client("bedrock-runtime", region_name="us-east-1")

response = client.converse(
    modelId="amazon.nova-pro-v1:0",
    messages=[{"role": "user", "content": [{"text": prompt}]}],
    inferenceConfig={"maxTokens": 512, "temperature": 0.0},
)

verdict = json.loads(
    response["output"]["message"]["content"][0]["text"]
)
# verdict = {"judgment": "pass", "critique": "The summary captures..."}

Temperature zero is deliberate. You want deterministic judgments, not creative ones. (Strictly speaking, temperature zero means greedy decoding — hardware and batching can introduce minor variance across calls. For binary pass/fail judgments, this variance is negligible in practice.)

The result on our 10-article dataset: 100% agreement with human labels. The judge correctly identified both vague summaries and provided detailed critiques explaining why they failed. To be clear: 10 items is a proof-of-concept, not a production validation. Husain recommends 30+ labeled examples minimum for real calibration. But the mechanics work — and the critiques are where the value lives. Here is what the judge said about a summary that read “Netflix is a streaming company that shows movies and TV shows. They are changing their technology”:

The summary is too vague and omits critical details. It fails to mention the specific architectural shift to durable execution patterns, the affected systems (content encoding pipeline, recommendation system, A/B testing infrastructure), the expected transition period of 18 months, and the anticipated 30% reduction in infrastructure costs.

That critique is actionable. A developer reading it knows exactly what the summary should have included. Compare that to a score of “2.3 out of 5 on completeness” — what do you do with that?

What Bedrock’s Managed Evaluation Adds

Amazon Bedrock Model Evaluation, generally available since March 2025, takes a different approach [5]. Instead of writing your own judge prompt, you select from built-in metrics (correctness, completeness, faithfulness, harmfulness, and others) and let Bedrock handle the evaluation with curated judge models and optimized prompts.

bedrock.create_evaluation_job(
    jobName="summary-eval",
    roleArn=role_arn,
    applicationType="ModelEvaluation",
    evaluationConfig={
        "automated": {
            "datasetMetricConfigs": [{
                "taskType": "General",
                "dataset": {
                    "name": "my-dataset",
                    "datasetLocation": {"s3Uri": dataset_uri},
                },
                "metricNames": [
                    "Builtin.Correctness",
                    "Builtin.Completeness",
                    "Builtin.Faithfulness",
                ],
            }],
            "evaluatorModelConfig": {
                "bedrockEvaluatorModels": [
                    {"modelIdentifier": "amazon.nova-pro-v1:0"}
                ],
            },
        },
    },
    inferenceConfig={
        "models": [{
            "bedrockModel": {
                "modelIdentifier": "amazon.nova-lite-v1:0",
                "inferenceParams": json.dumps({
                    "inferenceConfig": {"maxTokens": 256}
                }),
            },
        }],
    },
    outputDataConfig={"s3Uri": output_uri},
)

The managed evaluation generates its own responses from the generator model, scores them, and produces per-item explanations stored in S3. It is designed to answer a different question than the DIY judge: not “is this specific output good?” but “how does this model perform on this type of task?”

This distinction matters. When I ran both approaches on the same dataset, the managed evaluation scored all items highly — because it generated fresh summaries with Nova Lite, and Nova Lite produces good summaries. The DIY judge caught the two bad summaries because it evaluated the specific outputs I gave it, including the intentionally vague ones.

Comparison: DIY judge evaluates your specific outputs with pass/fail critiques. Bedrock Model Evaluation generates fresh responses and scores the model. — Two approaches, two questions: DIY judges verify your outputs. Managed evaluation benchmarks your model.

Neither approach replaces the other. The managed evaluation helps you pick the right model before you build. The DIY judge helps you verify your application works after you ship.

The Practical Workflow

Here is how I would combine both approaches for a production AI application:

Phase 1 — Model selection (Bedrock Model Evaluation). Before writing any application code, use the managed evaluation to compare candidate models on your task. Upload a representative prompt dataset, run evaluations against 2-3 models, compare correctness and completeness scores. This takes an afternoon and saves you from building on the wrong foundation.

Phase 2 — Application-specific judge (DIY). Once you have picked a model and built your application, follow Husain’s process. Get your domain expert. Create a dataset of real outputs. Have them make pass/fail judgments with critiques. Build a judge prompt using those critiques as few-shot examples. Measure agreement. Iterate until you trust it.

Phase 3 — Continuous monitoring. Deploy the DIY judge as part of your pipeline. Run it on a sample of production outputs daily or weekly. Track agreement with periodic human reviews. When agreement drifts, go back to Phase 2.

The managed evaluation tells you the model can do the job. The DIY judge tells you your application is doing the job. The human review tells you the judge is still doing its job. Each layer catches what the others miss.

When One Judge Is Not Enough

AWS published a pattern for deploying multiple LLMs as a jury — independent models evaluating the same output, with statistical agreement metrics to measure consensus [6]. Their research found inter-model agreement up to 91%, compared to human-to-model agreement of 79%. The gap is telling: models agree with each other more than they agree with humans.

This is exactly Husain’s point. High inter-model agreement is not the same as high human alignment. A jury of three models that all share verbosity bias will unanimously reward padded answers. The jury pattern is useful for catching outliers and reducing variance, but it does not eliminate the need for human calibration.

Use the jury when you need confidence that a judgment is not a model-specific artifact. Use human labels when you need confidence that the judgment is correct.

Start Here

If you are building an AI application and have no evaluation system, start with Husain’s Step 3: have someone who understands the domain look at 30 outputs and mark them pass or fail with a written explanation. That is it. No framework, no infrastructure, no managed service. Just a human looking at data.

Who is the “domain expert”? It does not have to be a PhD. Your product owner, a senior support engineer, or even a power user who knows what good output looks like. The bar is: can this person explain why a specific output is wrong? If yes, they qualify.

If that sounds too simple, consider that Husain has done this across 30+ implementations and his most consistent finding is that teams skip this step. They jump straight to automated metrics, build dashboards full of numbers nobody trusts, and wonder why their AI product is not improving.

The tools exist to automate what comes after. Bedrock Model Evaluation handles model comparison at scale. The Converse API lets you build custom judges in 50 lines of Python. For continuous monitoring, judge a sample (5-10%) of production outputs with a cheaper model — you do not need to evaluate everything in real-time to catch drift. But none of it works until a human has looked at the data and said what good looks like.

The code for the working example in this post is available on GitHub [4].

Sources

[1] Hamel Husain, “Verifying a LLM judge involves setting up a classic ML style test against human labels,” LinkedIn, May 2026. https://www.linkedin.com/posts/hamelhusain_verifying-a-llm-judge-involves-setting-up-share-7456024665988431873-UFkZ

[2] Hamel Husain and Shreya Shankar, “AI Evals For Engineers & PMs,” Maven. https://maven.com/parlance-labs/evals

[3] Hamel Husain, “Using LLM-as-a-Judge For Evaluation: A Complete Guide,” hamel.dev, October 2024. https://hamel.dev/blog/posts/llm-judge/

[4] Stefan Christoph, “LLM-as-a-Judge with Amazon Bedrock,” GitHub. https://github.com/stefanfreitag/llm-as-judge-bedrock

[5] “Amazon Bedrock Model Evaluation LLM-as-a-judge is now generally available,” AWS, March 2025. https://aws.amazon.com/about-aws/whats-new/2025/03/amazon-bedrock-model-evaluation-llm-as-a-judge/

[6] Sreyoshi Bhaduri et al., “AI judging AI: Scaling unstructured text analysis with Amazon Nova,” AWS Machine Learning Blog, August 2025. https://aws.amazon.com/blogs/machine-learning/ai-judging-ai-scaling-unstructured-text-analysis-with-amazon-nova/

About the Author

Stefan Christoph is a Principal Solutions Architect at AWS, focused on agentic AI, media & entertainment, and helping builders move from demo to production. He writes about AI architecture, developer productivity, and the future of software.

This is a personal blog. Opinions expressed here are my own and do not represent the views or positions of my employer.

Learn more →

Cross-posted to LinkedIn

❤️ Created with the support of AI (Kiro)