← Quizzes
Intermediate
4 questions · ~6 minEvals, judges, and statistical sanity
Designing scoring that detects regressions you'd care about.
Progress
0 of 4 answered · 0 correct
Q1
You have 200 eval cases and run n=1 sample each. You report mean = 0.78 and want to detect a 2-point regression. What's the issue?
Q2
Which judge prompt design is MOST robust against verbosity bias?
Q3
Your judge agreement with humans is Cohen's κ = 0.42. What does this mean and what should you do?
Q4