bonsai
← System Designs
System Design

RAG Evaluation Architecture

Score retrieval and generation independently, then jointly.

Problem

End-to-end RAG metrics conflate retrieval and generation failures. We need an evaluation harness that produces independent scores for the retriever, the generator, and the integrated system, with adversarial cases for each.

Goals

  • Retrieval scores: recall@k, MRR, context precision per slice.
  • Generation scores: groundedness, faithfulness, answer relevance.
  • Adversarial coverage: counterfactual docs, prompt injection in retrieved content, missing gold doc.

Non-goals

  • ·Index build / freshness — covered by a separate ingestion pipeline.

Architecture

InputProcessModelStoreJudgeOutput
rendering diagram…
Components
Question SetLabeled (question, gold_doc_ids, expected_behavior) tuples.
RetrieverBM25 + dense + reranker; the candidate stack.
Retrieval ScorerComputes recall@k, MRR, nDCG, context precision.
GeneratorLLM with retrieved context in prompt.
Generation JudgePer-claim groundedness + faithfulness + relevance via structured LLM.
Adversarial Case GenInserts contradicting / injected / irrelevant docs.
Sliced ReportRetrieval, generation, and joint metrics by question type.

Walkthrough

1. Score retrieval in isolation

Run the retriever on every question and compute recall@k for k ∈ {1,3,5,10}, MRR, and context precision against the labeled gold doc IDs. This score depends only on the retriever — it does not move when you change the LLM.

2. Score generation with controlled context

Pass each question to the generator with three context conditions: (a) gold context only, (b) retrieved context, (c) adversarially perturbed context. The judge scores per-claim groundedness and faithfulness, plus answer relevance. Differences across conditions isolate generator failures vs. retrieval failures.

3. Adversarial perturbations

The case generator builds: counterfactual variants (one doc contradicts the rest), injection variants (a doc contains 'ignore previous instructions...'), missing-gold variants (gold doc removed; correct behavior is to refuse), and irrelevant-only variants (no relevant doc; correct behavior is to refuse or escalate).

4. Sliced reporting

Report metrics sliced by question type (factoid, multi-hop, comparison), domain, and adversarial condition. A regression in 'multi-hop with missing gold doc' is a different story from a regression in 'factoid with clean retrieval' — the report makes that visible.

Tradeoffs

Dual scoring (gold context vs retrieved context)
Chose
Run generator twice per case
Over
Single retrieved-context run
Because
Decouples generator quality from retriever quality; doubles cost but enables attribution.
Per-claim citations required
Chose
Structured output with per-claim citations
Over
Free-text answers
Because
Per-claim citations let the judge score groundedness mechanically; free-text requires fuzzy matching.

Metrics to track

  • Recall@k, MRR, context precision per slice
  • Groundedness, faithfulness, answer relevance
  • Refusal rate on missing-gold cases
  • Injection-attack resistance rate