RAG Evaluation Architecture
Score retrieval and generation independently, then jointly.
Problem
End-to-end RAG metrics conflate retrieval and generation failures. We need an evaluation harness that produces independent scores for the retriever, the generator, and the integrated system, with adversarial cases for each.
Goals
- ✓Retrieval scores: recall@k, MRR, context precision per slice.
- ✓Generation scores: groundedness, faithfulness, answer relevance.
- ✓Adversarial coverage: counterfactual docs, prompt injection in retrieved content, missing gold doc.
Non-goals
- ·Index build / freshness — covered by a separate ingestion pipeline.
Architecture
Walkthrough
1. Score retrieval in isolation
Run the retriever on every question and compute recall@k for k ∈ {1,3,5,10}, MRR, and context precision against the labeled gold doc IDs. This score depends only on the retriever — it does not move when you change the LLM.
2. Score generation with controlled context
Pass each question to the generator with three context conditions: (a) gold context only, (b) retrieved context, (c) adversarially perturbed context. The judge scores per-claim groundedness and faithfulness, plus answer relevance. Differences across conditions isolate generator failures vs. retrieval failures.
3. Adversarial perturbations
The case generator builds: counterfactual variants (one doc contradicts the rest), injection variants (a doc contains 'ignore previous instructions...'), missing-gold variants (gold doc removed; correct behavior is to refuse), and irrelevant-only variants (no relevant doc; correct behavior is to refuse or escalate).
4. Sliced reporting
Report metrics sliced by question type (factoid, multi-hop, comparison), domain, and adversarial condition. A regression in 'multi-hop with missing gold doc' is a different story from a regression in 'factoid with clean retrieval' — the report makes that visible.
Tradeoffs
Metrics to track
- Recall@k, MRR, context precision per slice
- Groundedness, faithfulness, answer relevance
- Refusal rate on missing-gold cases
- Injection-attack resistance rate