← Curriculum
Foundations
~12 minlabeling
inter-rater-agreement
gold-sets
Human labeling and calibration
Your eval set is only as honest as the humans who labeled it.
Step 1 of 14
Every automated eval eventually traces back to a human judgment. Gold labels train your judges, calibrate your rubrics, and arbitrate every dispute about whether the model is improving. Labels written carelessly contaminate every downstream signal for months, invisibly. Labeling is not a clerical step — it's the foundation the rest of the program stands on.