Curriculum

The QA-for-AI curriculum

A working engineer's path through the field. Foundations to frontier — read in any order, but if you're new, start with What is QA for AI?

Foundations

3 lessons

What is QA for AI?

Why testing non-deterministic systems demands a new playbook.

Designing evals that actually catch regressions

From vibes to a dataset that pays rent.

~12 min

labeling

inter-rater-agreement

gold-sets

Human labeling and calibration

Your eval set is only as honest as the humans who labeled it.

Intermediate

5 lessons

LLM-as-judge: useful, biased, calibratable

Make a model grade another model — without lying to yourself.

Evaluating structured outputs

Parse rate is not correctness — they're two different evals.

Evaluating RAG: retrieval and generation are different problems

If you grade end-to-end you'll never know what's broken.

~14 min

release

promotion

CI for prompts, models, and tools

Treat prompts like code — but accept that the build is probabilistic.

Cost and latency as quality signals

A perfect answer the user never waited for is a failed answer.

Advanced

3 lessons

Agent evals: trajectories, not outcomes

When the system uses tools, only grading the final answer is malpractice.

Red-teaming and adversarial testing

If you don't break it, your users will.

Drift, observability, and the production loop

Pre-launch evals are necessary; production telemetry is sufficient.

Frontier

1 lesson

Frontier topics: multi-modal, long-horizon, self-improving evals

Where the field is heading in 2026 and beyond.