
RAG Evaluation Best Practices: A Complete Framework
Author:
The LayerLens Team
Last updated:
Published:
About the author: Jake Meany leads marketing at LayerLens, where he focuses on continuous evaluation infrastructure for AI. He regularly evaluates production RAG systems and has published extensively on judgment engineering and AI evaluation.
TL;DR
RAG evaluation requires a four-layer stack covering retrieval quality, context relevance, generation faithfulness, and answer quality. Without systematic evaluation across all four layers, you'll miss critical failure modes. The best RAG systems treat evaluation as continuous: monitoring drift, collecting user signals, and running A/B tests in production.
Introduction
Retrieval-Augmented Generation (RAG) has become the standard approach for building knowledge-intensive AI systems. But RAG success depends on something that many teams overlook: systematic evaluation across the entire pipeline.
A RAG system can fail at multiple stages. Your retriever might pull irrelevant documents. Your context window might include noisy passages. Your language model might hallucinate despite seeing the right source material. Your answer might be factually correct but fail to address the user's actual question.
This post walks you through a complete evaluation framework that catches failures at each stage, explains how to measure them, and shows you how to integrate evaluation into your production workflow.
The Four-Layer Evaluation Stack
RAG evaluation breaks down into four distinct layers, each with its own metrics and failure modes:
Retrieval Quality: Did the retriever fetch relevant documents?
Context Relevance: Are the retrieved documents actually useful for answering the question?
Generation Faithfulness: Is the generated answer grounded in the source material?
Answer Quality: Does the answer satisfy the user's intent?
Most teams focus heavily on layer four (answer quality) while neglecting layers one through three. This is a mistake. If your retriever is pulling junk, no downstream layer can fix it. The framework forces you to think systematically about where your system actually breaks.Layer 1: Retrieval Metrics
Retrieval quality measures whether your retriever is fetching relevant documents. These are your foundational metrics.
Precision@K answers: "Of the top K documents retrieved, how many are actually relevant?" If you retrieve 10 documents and 7 are useful, your Precision@10 is 0.7. This metric matters because users rarely scroll past the first few results.
Recall@K answers: "Of all relevant documents in the corpus, what fraction did we retrieve in the top K?" If there are 20 relevant documents total and you retrieve 15 of them in your top 100, Recall@100 is 0.75. This metric tells you if your retriever is missing important information entirely.
Mean Reciprocal Rank (MRR) measures how high the first relevant result appears. If the first relevant document is at position three, MRR contribution is 1/3. This metric captures user experience: a system where every relevant document is in the top five is better than one where relevant documents are scattered throughout.
NDCG (Normalized Discounted Cumulative Gain) combines ranking quality with relevance scoring. Unlike binary relevance metrics, NDCG acknowledges that some documents are more relevant than others. A document highly relevant to the query contributes more than a marginally relevant one.
To measure retrieval metrics, you need ground truth: for each query, what documents are actually relevant? For customer-facing systems, this means building a labeled evaluation set. Start with 100-200 query-document pairs where domain experts mark relevance. Then run your retriever and compute metrics.
Red flag: If your Precision@10 is below 0.6, your retriever is pulling too much noise and downstream systems can't compensate. If Recall@100 is below 0.7, you're missing critical information entirely.
Layer 2: Context Relevance
A retriever can fetch nominally relevant documents that still fail to support the question being asked. Context relevance measures whether the retrieved documents actually contain information useful for answering.
This layer exists because full-document retrieval often pulls documents with low information density. You might retrieve a 10,000-token document that contains only 200 tokens useful for the question. Your LLM has to sift through noise.
Measure context relevance through passage-level annotation: For each retrieved document, have evaluators mark which passages actually support the query. Then compute what fraction of the context window contains supporting information.
Alternatively, use NLI (Natural Language Inference) scoring: Feed the retrieved context plus the query to a natural language inference model (like a fine-tuned RoBERTa or a judge LLM). Does the context entail an answer to the query? This gives you a continuous score without manual annotation.
Red flag: If less than 40% of your context window contains supporting information, you should either refine your retriever to use smaller chunks or implement re-ranking to surface the most relevant passages first.
Layer 3: Generation Faithfulness
Faithfulness measures whether the generated answer is grounded in the source material. An answer can sound plausible but hallucinate facts not present in the context.
Claim-level verification breaks the generated answer into factual claims, then checks whether each claim is supported by the context. For example, if the answer says "LayerLens was founded in 2023," extract that claim and verify it appears in the context.
Attribution scoring requires the model to cite which source passage supports each claim. A faithful system can always point to supporting evidence. If the model makes a claim without a citation, that's a signal of potential hallucination.
NLI-based faithfulness uses an inference model: Given the retrieved context as premises and the generated answer as hypothesis, does the context entail the answer? This catches subtle hallucinations where the answer is close to but not rather supported by the sources.
Red flag: If your faithfulness score is below 0.85, users will encounter hallucinations at meaningful frequency. This is a critical quality threshold for production systems.Layer 4: Answer Quality
Answer quality measures whether the system actually answered the user's question correctly and entirely. This is what matters most to end users.
Relevance rubrics ask evaluators to score answers on a scale (e.g., 1-5) for how well they address the query. This is subjective but captures user-facing quality.
Completeness scoring checks whether the answer covers all major points needed to satisfy the question. If someone asks "What are the pricing tiers?" and the answer mentions only one tier, completeness is low.
User satisfaction signals (in production) include thumbs-up/thumbs-down ratings, session length, and whether users ask follow-up questions. These reveal what actually matters to your audience.
Best practice: Collect answer quality judgments from domain experts using a structured rubric. Aim for 300-500 labeled examples so you have a statistically meaningful benchmark.
Integrating End-to-End Evaluation
The four layers are interdependent. You can't improve answer quality without understanding where failures originate. End-to-end evaluation means instrumenting your system to capture signals at each layer simultaneously.
When a user reports an unsatisfactory answer, systematically ask: Which layer failed?
Did the retriever fetch relevant documents? (Layer 1)
Did the context contain supporting information? (Layer 2)
Did the model hallucinate despite good context? (Layer 3)
Or did the model answer correctly but miss the user's intent? (Layer 4)
This diagnostic approach prevents shooting in the dark. You'll know whether to invest in a better retriever, improve your chunking strategy, fine-tune your model, or refine your prompt.
Common RAG Failure Modes
Most RAG failures cluster into a few patterns. Knowing them helps you design evaluation that catches them early.
Lost-in-the-middle problem: When documents are too long, models attend poorly to information in the middle. Relevant facts get buried between noise. Solution: Use smaller chunks and re-ranking.
Retriever-generator mismatch: Your retriever optimizes for one signal but your generator needs different information. For example, a BM25 retriever might pull documents with high keyword overlap that contain wrong answers. Solution: Train your retriever on labels that your generator actually needs.
Context confusion: When you retrieve many documents, the model struggles to integrate information across them and may weight the wrong sources. Solution: Limit context window to top-3 or top-5 documents, or use an explicit multi-hop reasoner.
Temporal drift: Your evaluation set was built three months ago, but your corpus has changed. New documents have been added, old information has been updated. Solution: Re-evaluate regularly, especially after corpus updates.
Production Monitoring: From Evaluation to Operations
Evaluation doesn't end at launch. Production RAG systems drift constantly. User behavior changes. Your corpus evolves. Your LLM provider releases new models. Smart teams instrument production to catch degradation early.
Drift detection: Track your four-layer metrics over time. If Precision@10 drops from 0.8 to 0.6 month-over-month, something broke. This might be a corpus change, a retriever update, or a shift in query distribution.
User feedback signals: Collect thumbs-up/thumbs-down ratings on every answer. Track hallucination reports. Monitor follow-up questions. These are early warning signs of quality degradation.
Hallucination rate tracking: Run periodic audits where you sample 100 recent answers and manually score faithfulness. If hallucination rate exceeds your tolerance threshold, trigger an investigation.
A/B testing: Don't change your system and hope for the best. A/B test retriever changes, prompt changes, and model upgrades. Compare against your baseline using your four-layer evaluation framework.Building Continuous Evaluation with Stratix
Manual evaluation doesn't scale. For production RAG systems, you need continuous evaluation. This is where Stratix comes in.
Stratix is LayerLens' continuous evaluation infrastructure. Instead of running evaluation once and moving on, Stratix lets you define benchmarks, create judges, and run ongoing evaluations as your system and data change.
Here's a practical example using the Stratix Python SDK:
Initialize the Stratix client, create a RAG evaluation benchmark with query-ground truth pairs, build a faithfulness judge backed by a strong model, run evaluations, and poll for results. The SDK handles versioning, history tracking, and comparison across runs automatically.
The Stratix approach handles the bookkeeping: It stores your benchmark, versions your judges, tracks evaluation history, and lets you compare runs over time. You can ask questions like "Has my faithfulness score improved since we updated the retriever last week?"
Building Production-Grade Judges
Stratix uses LLM-based judges for most four-layer evaluations. A good judge is more than a prompt. It needs clear rubrics, calibration on your domain, and regular optimization.
Rubric design: Define what you're measuring precisely. Don't ask a judge "Is this answer good?" Instead ask "Does this answer address all the key points mentioned in the query? Rate 1-5." Specificity drives consistency.
Judge calibration: Build a small labeled set (50-100 examples) where you've manually scored quality. Use Stratix to optimize the judge against your labels. The optimization process tunes the judge's weights and decision boundaries.
Judge versioning: Keep the full history of judge versions. If you optimize a judge and scores drop, you can revert. If regulatory review asks "how did you measure quality," you have an audit trail.
Practical Implementation Checklist
Getting RAG evaluation right is a multi-week effort, not a weekend project. Here's what production implementation looks like:
Week 1: Build ground truth. Label 150-200 query-document pairs for retrieval quality and 100-150 for answer quality.
Week 2: Establish baseline metrics. Measure all four layers on your current system. Identify obvious failure modes.
Week 3: Design judges. Create detailed rubrics for faithfulness and context relevance. Build a small calibration set.
Week 4: Set up continuous evaluation. Wire Stratix into your evaluation pipeline. Test that it runs without human intervention.
Weeks 5+: Iterate. Use evaluation insights to guide improvements. Re-evaluate after each change. Track long-term trends.Key Takeaways
RAG evaluation requires four layers: retrieval, context relevance, faithfulness, and answer quality.
Most teams over-index on answer quality and miss retrieval and faithfulness problems.
Without evaluation, you're flying blind. The four-layer framework gives you diagnostics.
Production RAG systems must monitor drift and collect user signals continuously.
Continuous evaluation infrastructure (like Stratix) is essential for scaling beyond manual spot-checks.
LLM-based judges can automate evaluation but require careful rubric design and calibration.
Frequently Asked Questions
What if I don't have labeled ground truth?
Start small. Label 50 examples manually using domain experts. That's enough to establish a baseline. Then use user feedback to iteratively improve your labels. Stratix judges can bootstrap from weak signals (thumbs-up/thumbs-down) to provide continuous feedback without manual labeling on every answer.
How often should I re-evaluate?
At minimum, quarterly. But best practice is continuous. Every time you change your retriever, update your chunks, or upgrade your model, run evaluation. Monthly production monitoring (sampling recent queries) catches drift early.
Should I use my own embedding model or a commercial API?
Commercial embeddings (OpenAI, Cohere) are reliable starting points but may not capture domain-specific nuances. If your RAG task is specialized (legal documents, medical data), fine-tuning an embedding model on your domain can improve retrieval significantly. Stratix lets you evaluate both approaches and measure the trade-off.
How do I balance retrieval speed and quality?
Precision@5 and Precision@10 are your friends. A fast retriever that gets the top 5 results right (even if it returns 100 total) is often better than a slow retriever that ranks documents perfectly. Use A/B testing to find your speed-quality frontier.
Can I use a single judge for all four layers?
Technically yes, but not recommended. Each layer tests different capabilities. A judge that's well-calibrated for faithfulness might struggle with context relevance. Build separate judges per layer, then compose them for end-to-end scoring. Stratix supports this workflow through judge chaining.What's the relationship between retrieval metrics and answer quality?
Retrieval is necessary but not sufficient. You can have high Recall and still get bad answers if your context is confusing or your model hallucinates. The four-layer model helps you untangle correlation from causation. If Recall is high but answer quality is low, the problem is downstream.
How do I know when my RAG system is ready for production?
When all four layers meet your thresholds: Precision@10 greater than 0.7, context relevance score greater than 0.75, faithfulness greater than 0.85, and answer quality greater than 4.0 on a 5-point scale (from expert evaluation). But also establish drift-detection guardrails so you catch degradation after launch.
Should I fine-tune my LLM for RAG or use a base model with good retrieval?
Start with a strong base model (Claude Sonnet 4.6, GPT-4o) and excellent retrieval. Most RAG problems are retrieval problems. If you've optimized retrieval and context relevance but answer quality is still low, then consider fine-tuning. Use Stratix to run A/B tests comparing base vs. fine-tuned models.
How do I handle multi-hop questions that require synthesizing information across documents?
Standard retrieval often fails on multi-hop because single-stage retrieval doesn't see the full question context. Solutions: (1) Use an agentic approach where the model iteratively retrieves based on intermediate reasoning. (2) Create synthetic multi-hop examples in your benchmark and evaluate explicitly. (3) Use Stratix judges that score multi-hop reasoning quality separately from single-hop faithfulness.
What metrics matter most for production RAG?
Faithfulness (layer 3) and answer quality (layer 4) matter most for user experience. But retrieval (layer 1) and context relevance (layer 2) are leading indicators. If layer 1 degrades, layer 4 will follow weeks later. Monitor all four, but prioritize diagnostics on layers 1-2 when something breaks.
Methodology Note
This framework draws from academic RAG evaluation literature (particularly recent work on multi-layer evaluation) combined with production experience from building Stratix at LayerLens. The four-layer model has been validated on dozens of production RAG systems. Specific metrics (Precision@K, NDCG) are standard in information retrieval; faithfulness measurement via NLI is an active research area where we've seen strong practical results.
Next Steps
If you're building a RAG system, start here: (1) Define your four-layer metrics. (2) Build ground truth for at least retrieval and answer quality. (3) Establish baselines on your current system. (4) Set up continuous evaluation with Stratix. (5) Use evaluation insights to prioritize improvements.
For deeper technical dives, see our other articles: Retrieval Strategies for RAG, Chunking Strategies That Actually Work, Fine-tuning vs. Retrieval: When to Do Each, and Detecting and Mitigating LLM Hallucinations in RAG.