LLM-as-judge: useful, biased, calibratable

Make a model grade another model — without lying to yourself.

Step 1 of 14

LLM-as-judge is the most cost-effective way to score open-ended generations at scale. It is also the most common source of false confidence in AI QA. Treat your judge like an unreliable contractor — measure it, calibrate it, constrain it.

← Human labeling and calibration