AI Evaluation Glossary: 25 Terms Every ML Team Needs in 2026

Author:

The LayerLens Team

Last updated:

Apr 30, 2026

Published:

Apr 30, 2026

Published by the LayerLens team. LayerLens is continuous evaluation infrastructure for AI. Stratix is the evaluation engine: 200+ models, agentic benchmarks, judge optimization, and audit-ready comparisons across vendors.

TL;DR

The vocabulary of LLM and agent evaluation moved fast in 2025 and 2026. New terms (GEPA, deliberation panel, agentic judge) are now load-bearing in production stacks.
Old terms (benchmark, leaderboard, eval set) carry baggage that misleads teams that do not know which definition the speaker has in mind.
This glossary defines 25 terms a working ML team needs: Continuous Evaluation, Eval, Benchmark, Judge, LLM-as-Judge, Agent-as-Judge, Agentic Judge, Deliberation Panel, GEPA, Trace, Output-Level Eval, Step-Level Eval, Evaluation Space, Scorer, Rubric, Golden Set, Cohen's kappa, Silent Failure, Lucky Path, Position Bias, Length Bias, Cascade Eval, Replay, Pre-deployment Eval, Version Anchoring.
Each entry links to a deeper guide where one exists.

The vocabulary around LLM and agent evaluation moved fast in 2025 and 2026. New terms are now load-bearing in production stacks. Old terms carry baggage that can mislead a team that does not know which definition the speaker has in mind.

This glossary defines the 25 terms a working ML team needs in 2026. Each entry is short by design. Where a term has a deeper guide on the LayerLens blog, the link is in the entry.

1. Continuous evaluation

The practice of running automated evals on every model version, prompt change, judge update, and live trace, instead of a single benchmark sweep at release. The unit of measurement is the system in motion. Full guide.

2. Eval (evaluation)

A scored measurement of a system's behavior on a defined input or task. An eval has three parts: an input, a system response, and a grade. The grade can come from a human, a rule, or another model.

3. Benchmark

A frozen, public dataset with reference answers used to compare models. Useful for cross-model comparisons. Insufficient for production quality, because production prompts, tools, and traffic are not in the benchmark.

4. Judge

A model (or system of models) that grades another model's output against a rubric. Most production teams start with a single LLM-as-Judge and graduate up the 4-Generation Evaluation Ladder over time.

5. LLM-as-Judge (Gen 1)

A single LLM scoring an output against a rubric in one pass. Accuracy ceiling against human raters: roughly 70 percent. Cheap, fast, the right starting point.

6. Agent-as-Judge (Gen 2)

A judge that is itself an agent: it can call tools, query references, and ground its score in retrieved evidence. Accuracy ceiling: roughly 85 percent.

7. Agentic Judge (Gen 3)

A multi-step judge that plans, executes sub-evaluations, and aggregates. Roughly 90 percent agreement with humans.

8. Deliberation Panel (Gen 4)

Multiple judges (often heterogeneous models) debate and converge on a verdict. Accuracy ceiling: 96 to 98 percent. Currently a Stratix-only capability.

9. GEPA (Genetic Evolutionary Prompt Adaptation)

A prompt-optimization algorithm for LLM judges. Treats the judge prompt as a genome and evolves it over generations against a labeled fitness set. Typically lifts judge agreement with humans by 8 to 20 points. Full guide.

10. Trace

The full record of a single agent or LLM run: prompt, plan, tool calls, tool outputs, intermediate reasoning, and final response. A production system emits one trace per request.

11. Output-level evaluation

Scoring only the final response in a trace. Adequate for single-prompt systems. Misses lucky-path and reroll failures in agents.

12. Step-level evaluation

Scoring every tool call, retrieval, and intermediate decision inside a trace. Catches failures that output-level scoring misses. Full guide.

13. Evaluation space

A scoped collection of traces, judges, scorers, and reference data tied to a single product surface or use case. Stratix uses evaluation spaces to keep enterprise data partitioned and judges versioned per surface.

14. Scorer

A deterministic grader: regex match, JSON-schema validator, exact match, BLEU, ROUGE, etc. Scorers are cheap and reliable but cannot grade open-ended quality. Used alongside judges, not instead of them.

15. Rubric

The scoring criteria a judge uses to grade a response. A weak rubric is a leading cause of low judge agreement with humans. A strong rubric is specific, scoped, and tied to evidence the judge can read.

16. Golden set

A fixed, hand-labeled dataset used as a regression-test target. Pre-deployment evals run against the golden set on every prompt or model change. Quality drift on the golden set is the first signal of a regression.

17. Cohen's kappa

A statistical measure of agreement between two raters that corrects for chance. Used to score how well a judge agrees with a human rater. 0.7 or higher is solid. 0.4 is roughly a coin flip.

18. Silent failure

A production failure that does not raise an error or surface in latency dashboards. The system returns an answer, the answer is wrong, no log line flags it. Continuous evaluation is the primary defense against silent failures.

19. Lucky path

A trace that produces the right final answer through wrong tool calls. Output-level scoring marks it correct. Step-level scoring catches it. Lucky paths are the textbook case for why agents need step-level evaluation.

20. Position bias

The tendency of an LLM judge to over-prefer responses that appear earlier in a comparison prompt. A common source of judge calibration error. Tuned judge prompts (often produced by GEPA) correct for it.

21. Length bias

The tendency of an LLM judge to score longer responses higher regardless of correctness. Another common calibration gap closed by judge optimization.

22. Cascade eval

A pipeline that runs cheap evals first (scorers, regex, format checks) and only escalates to expensive evals (LLM-as-Judge, agentic judge) on failures or sampled traces. Standard pattern for high-volume production systems.

23. Replay

Re-running a stored production trace against a new model, prompt, or judge to see how the system would have behaved. Replay is the cheapest way to test a model swap without exposing real users to it.

24. Pre-deployment eval

An automated eval suite that runs on every PR before merge. Blocks the change if a score drops below a threshold. The cheapest place in the pipeline to catch a regression.

25. Version anchoring

The discipline of attaching a model hash, prompt hash, judge hash, and evaluation-space ID to every score. Without version anchoring, a quality improvement on Tuesday cannot be reproduced on Friday. With it, every regression can be traced back to a specific upstream change.

Open the Stratix Learning Hub

Every term in this glossary has a runnable example, a worked judge prompt, or a labeled dataset in the Stratix Learning Hub. Open the Hub here.

Key Takeaways

If a vendor says they support evaluation, ask which generation of judge architecture they ship: LLM-as-Judge (Gen 1), Agent-as-Judge (Gen 2), Agentic Judge (Gen 3), or Deliberation Panel (Gen 4). Each has a different accuracy ceiling.
Cohen's kappa above 0.7 is solid agreement between a judge and a human rater. Anything around 0.4 is a coin flip dressed up as a number.
Silent failures, lucky paths, and rerolls are the three failure modes output-level scoring cannot see. Step-level scoring is the standard fix.
Version anchoring (model hash, prompt hash, judge hash) is the discipline that makes any continuous evaluation work reproducible.

Frequently Asked Questions

What is a Deliberation Panel?

A Gen 4 evaluation architecture in which multiple judges (often heterogeneous models) debate and converge on a verdict. Accuracy ceiling: 96 to 98 percent agreement with humans. Currently a Stratix-only capability.

What is the difference between a scorer and a judge?

A scorer is a deterministic grader: regex match, JSON-schema validator, exact match, BLEU, ROUGE. A judge is a model that grades open-ended quality against a rubric. Scorers are cheap and reliable. Judges handle the open-ended cases scorers cannot. Most production stacks use both.

What is a golden set?

What is version anchoring and why does it matter?

Version anchoring is the discipline of attaching a model hash, prompt hash, judge hash, and evaluation-space ID to every score. Without it, a quality improvement on Tuesday cannot be reproduced on Friday. With it, every regression can be traced back to a specific upstream change.

Methodology

This glossary aggregates terms in active use in 2026 across LayerLens, Stratix evaluation spaces, and the broader LLM evaluation literature. Definitions are tightened to match how working ML teams use the terms, not the broader marketing usage that can sometimes drift from operational meaning.

Open the Stratix Learning Hub to run the templates and worked examples referenced in this post against your own data.

‹ From LangSmith to Stratix: A Migration Guide for Eval Pipelines

Step-Level Evaluation vs Output-Level Evaluation for AI Agent Traces ›