bonsaiCultivate AI you can trust

Intermediate

4 questions · ~6 min

Evals, judges, and statistical sanity

Designing scoring that detects regressions you'd care about.

Progress

0 of 4 answered · 0 correct

Q1

You have 200 eval cases and run n=1 sample each. You report mean = 0.78 and want to detect a 2-point regression. What's the issue?

Q2

Which judge prompt design is MOST robust against verbosity bias?

Q3

Your judge agreement with humans is Cohen's κ = 0.42. What does this mean and what should you do?

Q4

Which is the strongest argument for using rubrics over golden outputs in open-ended generation?