<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>schristoph.online</title><link>https://schristoph.online/tags/llmevals/</link><description>Personal homepage and blog of Stefan Christoph</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><copyright>Stefan Christoph. All rights reserved.</copyright><lastBuildDate>Fri, 22 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://schristoph.online/tags/llmevals/index.xml" rel="self" type="application/rss+xml"/><item><title>Your AI Judge Needs a Judge</title><link>https://schristoph.online/blog/your-ai-judge-needs-a-judge/?utm=rss-feed</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://schristoph.online/blog/your-ai-judge-needs-a-judge/</guid><description>&lt;blockquote>
&lt;p>&lt;strong>TL;DR:&lt;/strong> Most teams ship LLM judges without testing them against human labels. The result: judges that are confidently, consistently wrong on 30%+ of cases. Hamel Husain&amp;rsquo;s &amp;ldquo;critique shadowing&amp;rdquo; methodology (pass/fail + written critiques from a domain expert) builds trustworthy judges. Amazon Bedrock Model Evaluation handles model comparison at scale. You need both — plus periodic human review to catch drift.&lt;/p>&lt;/blockquote>
&lt;p>You would not ship a feature without testing it. You would not deploy a model without benchmarking it. But most teams ship their LLM judges without verifying them at all.&lt;/p></description></item></channel></rss>