System Design

RAG Evaluation Architecture

Score retrieval and generation independently, then jointly.

Problem

End-to-end RAG metrics conflate retrieval and generation failures. We need an evaluation harness that produces independent scores for the retriever, the generator, and the integrated system, with adversarial cases for each.

Goals

✓Retrieval scores: recall@k, MRR, context precision per slice.
✓Generation scores: groundedness, faithfulness, answer relevance.
✓Adversarial coverage: counterfactual docs, prompt injection in retrieved content, missing gold doc.

Non-goals

·Index build / freshness — covered by a separate ingestion pipeline.

Architecture

InputProcessModelStoreJudgeOutput

rendering diagram…

Components

Question SetLabeled (question, gold_doc_ids, expected_behavior) tuples.

RetrieverBM25 + dense + reranker; the candidate stack.

Retrieval ScorerComputes recall@k, MRR, nDCG, context precision.

GeneratorLLM with retrieved context in prompt.

Generation JudgePer-claim groundedness + faithfulness + relevance via structured LLM.

Adversarial Case GenInserts contradicting / injected / irrelevant docs.

Sliced ReportRetrieval, generation, and joint metrics by question type.

Walkthrough

1. Score retrieval in isolation

Run the retriever on every question and compute recall@k for k ∈ {1,3,5,10}, MRR, and context precision against the labeled gold doc IDs. This score depends only on the retriever — it does not move when you change the LLM.

2. Score generation with controlled context

Pass each question to the generator with three context conditions: (a) gold context only, (b) retrieved context, (c) adversarially perturbed context. The judge scores per-claim groundedness and faithfulness, plus answer relevance. Differences across conditions isolate generator failures vs. retrieval failures.

3. Adversarial perturbations

The case generator builds: counterfactual variants (one doc contradicts the rest), injection variants (a doc contains 'ignore previous instructions...'), missing-gold variants (gold doc removed; correct behavior is to refuse), and irrelevant-only variants (no relevant doc; correct behavior is to refuse or escalate).

4. Sliced reporting

Report metrics sliced by question type (factoid, multi-hop, comparison), domain, and adversarial condition. A regression in 'multi-hop with missing gold doc' is a different story from a regression in 'factoid with clean retrieval' — the report makes that visible.

Tradeoffs

Dual scoring (gold context vs retrieved context)

Chose

Run generator twice per case

Over

Single retrieved-context run

Because

Decouples generator quality from retriever quality; doubles cost but enables attribution.

Per-claim citations required

Chose

Structured output with per-claim citations

Over

Free-text answers

Because

Per-claim citations let the judge score groundedness mechanically; free-text requires fuzzy matching.

Metrics to track

Recall@k, MRR, context precision per slice
Groundedness, faithfulness, answer relevance
Refusal rate on missing-gold cases
Injection-attack resistance rate