
AI Evaluation Glossary: 25 Terms Every ML Team Needs in 2026
Author:
The LayerLens Team
Last updated:
Published:
Published by the LayerLens team. LayerLens is continuous evaluation infrastructure for AI. Stratix is the evaluation engine: 200+ models, agentic benchmarks, judge optimization, and audit-ready comparisons across vendors.
TL;DR
The vocabulary of LLM and agent evaluation moved fast in 2025 and 2026. New terms (GEPA, deliberation panel, agentic judge) are now load-bearing in production stacks.
Old terms (benchmark, leaderboard, eval set) carry baggage that misleads teams that do not know which definition the speaker has in mind.
This glossary defines 25 terms a working ML team needs: Continuous Evaluation, Eval, Benchmark, Judge, LLM-as-Judge, Agent-as-Judge, Agentic Judge, Deliberation Panel, GEPA, Trace, Output-Level Eval, Step-Level Eval, Evaluation Space, Scorer, Rubric, Golden Set, Cohen's kappa, Silent Failure, Lucky Path, Position Bias, Length Bias, Cascade Eval, Replay, Pre-deployment Eval, Version Anchoring.
Each entry links to a deeper guide where one exists.
The vocabulary around LLM and agent evaluation moved fast in 2025 and 2026. New terms are now load-bearing in production stacks. Old terms carry baggage that can mislead a team that does not know which definition the speaker has in mind.
This glossary defines the 25 terms a working ML team needs in 2026. Each entry is short by design. Where a term has a deeper guide on the LayerLens blog, the link is in the entry.
1. Continuous evaluation
The practice of running automated evals on every model version, prompt change, judge update, and live trace, instead of a single benchmark sweep at release. The unit of measurement is the system in motion. Full guide.
2. Eval (evaluation)
A scored measurement of a system's behavior on a defined input or task. An eval has three parts: an input, a system response, and a grade. The grade can come from a human, a rule, or another model.
3. Benchmark
A frozen, public dataset with reference answers used to compare models. Useful for cross-model comparisons. Insufficient for production quality, because production prompts, tools, and traffic are not in the benchmark.
4. Judge
A model (or system of models) that grades another model's output against a rubric. Most production teams start with a single LLM-as-Judge and graduate up the 4-Generation Evaluation Ladder over time.
5. LLM-as-Judge (Gen 1)
A single LLM scoring an output against a rubric in one pass. Accuracy ceiling against human raters: roughly 70 percent. Cheap, fast, the right starting point.
6. Agent-as-Judge (Gen 2)
A judge that is itself an agent: it can call tools, query references, and ground its score in retrieved evidence. Accuracy ceiling: roughly 85 percent.
7. Agentic Judge (Gen 3)
A multi-step judge that plans, executes sub-evaluations, and aggregates. Roughly 90 percent agreement with humans.
8. Deliberation Panel (Gen 4)
Multiple judges (often heterogeneous models) debate and converge on a verdict. Accuracy ceiling: 96 to 98 percent. Currently a Stratix-only capability.
9. GEPA (Genetic Evolutionary Prompt Adaptation)
A prompt-optimization algorithm for LLM judges. Treats the judge prompt as a genome and evolves it over generations against a labeled fitness set. Typically lifts judge agreement with humans by 8 to 20 points. Full guide.
10. Trace
The full record of a single agent or LLM run: prompt, plan, tool calls, tool outputs, intermediate reasoning, and final response. A production system emits one trace per request.
11. Output-level evaluation
Scoring only the final response in a trace. Adequate for single-prompt systems. Misses lucky-path and reroll failures in agents.
12. Step-level evaluation
Scoring every tool call, retrieval, and intermediate decision inside a trace. Catches failures that output-level scoring misses. Full guide.
13. Evaluation space
A scoped collection of traces, judges, scorers, and reference data tied to a single product surface or use case. Stratix uses evaluation spaces to keep enterprise data partitioned and judges versioned per surface.
14. Scorer
A deterministic grader: regex match, JSON-schema validator, exact match, BLEU, ROUGE, etc. Scorers are cheap and reliable but cannot grade open-ended quality. Used alongside judges, not instead of them.
15. Rubric
The scoring criteria a judge uses to grade a response. A weak rubric is a leading cause of low judge agreement with humans. A strong rubric is specific, scoped, and tied to evidence the judge can read.
16. Golden set
A fixed, hand-labeled dataset used as a regression-test target. Pre-deployment evals run against the golden set on every prompt or model change. Quality drift on the golden set is the first signal of a regression.
17. Cohen's kappa
A statistical measure of agreement between two raters that corrects for chance. Used to score how well a judge agrees with a human rater. 0.7 or higher is solid. 0.4 is roughly a coin flip.
18. Silent failure
A production failure that does not raise an error or surface in latency dashboards. The system returns an answer, the answer is wrong, no log line flags it. Continuous evaluation is the primary defense against silent failures.
19. Lucky path
A trace that produces the right final answer through wrong tool calls. Output-level scoring marks it correct. Step-level scoring catches it. Lucky paths are the textbook case for why agents need step-level evaluation.
20. Position bias
The tendency of an LLM judge to over-prefer responses that appear earlier in a comparison prompt. A common source of judge calibration error. Tuned judge prompts (often produced by GEPA) correct for it.
21. Length bias
The tendency of an LLM judge to score longer responses higher regardless of correctness. Another common calibration gap closed by judge optimization.
22. Cascade eval
A pipeline that runs cheap evals first (scorers, regex, format checks) and only escalates to expensive evals (LLM-as-Judge, agentic judge) on failures or sampled traces. Standard pattern for high-volume production systems.
23. Replay
Re-running a stored production trace against a new model, prompt, or judge to see how the system would have behaved. Replay is the cheapest way to test a model swap without exposing real users to it.
24. Pre-deployment eval
An automated eval suite that runs on every PR before merge. Blocks the change if a score drops below a threshold. The cheapest place in the pipeline to catch a regression.
25. Version anchoring
The discipline of attaching a model hash, prompt hash, judge hash, and evaluation-space ID to every score. Without version anchoring, a quality improvement on Tuesday cannot be reproduced on Friday. With it, every regression can be traced back to a specific upstream change.
Further reading
What is continuous evaluation? A working definition for production AI teams
Judge optimization with GEPA: tuning evaluation prompts at scale
Step-level evaluation versus output-level evaluation for agent traces
Open the Stratix Learning Hub
Every term in this glossary has a runnable example, a worked judge prompt, or a labeled dataset in the Stratix Learning Hub. Open the Hub here.
Key Takeaways
If a vendor says they support evaluation, ask which generation of judge architecture they ship: LLM-as-Judge (Gen 1), Agent-as-Judge (Gen 2), Agentic Judge (Gen 3), or Deliberation Panel (Gen 4). Each has a different accuracy ceiling.
Cohen's kappa above 0.7 is solid agreement between a judge and a human rater. Anything around 0.4 is a coin flip dressed up as a number.
Silent failures, lucky paths, and rerolls are the three failure modes output-level scoring cannot see. Step-level scoring is the standard fix.
Version anchoring (model hash, prompt hash, judge hash) is the discipline that makes any continuous evaluation work reproducible.
Frequently Asked Questions
What is a Deliberation Panel?
A Gen 4 evaluation architecture in which multiple judges (often heterogeneous models) debate and converge on a verdict. Accuracy ceiling: 96 to 98 percent agreement with humans. Currently a Stratix-only capability.
What is the difference between a scorer and a judge?
A scorer is a deterministic grader: regex match, JSON-schema validator, exact match, BLEU, ROUGE. A judge is a model that grades open-ended quality against a rubric. Scorers are cheap and reliable. Judges handle the open-ended cases scorers cannot. Most production stacks use both.
What is a golden set?
A fixed, hand-labeled dataset used as a regression-test target. Pre-deployment evals run against the golden set on every prompt or model change. Quality drift on the golden set is the first signal of a regression.
What is version anchoring and why does it matter?
Version anchoring is the discipline of attaching a model hash, prompt hash, judge hash, and evaluation-space ID to every score. Without it, a quality improvement on Tuesday cannot be reproduced on Friday. With it, every regression can be traced back to a specific upstream change.
Methodology
This glossary aggregates terms in active use in 2026 across LayerLens, Stratix evaluation spaces, and the broader LLM evaluation literature. Definitions are tightened to match how working ML teams use the terms, not the broader marketing usage that can sometimes drift from operational meaning.
Open the Stratix Learning Hub to run the templates and worked examples referenced in this post against your own data.