Judge Optimization with GEPA: How to Tune LLM Evaluation Prompts at Scale

Author:

The LayerLens Team

Last updated:

Apr 30, 2026

Published:

Apr 30, 2026

Published by the LayerLens team. LayerLens is continuous evaluation infrastructure for AI. Stratix is the evaluation engine: 200+ models, agentic benchmarks, judge optimization, and audit-ready comparisons across vendors.

TL;DR

Judge optimization tunes an LLM judge's evaluation prompt so its scores agree more closely with human labels.
GEPA (Genetic Evolutionary Prompt Adaptation) treats the judge prompt as a genome and evolves it over generations against a labeled fitness set.
A typical GEPA pass lifts judge-human agreement by 8 to 20 points without changing the judge model.
The high-impact triggers are launch, model swap, monthly drift cadence, and label-set expansion.
Cohen's kappa is the right starting metric for ordinal judge scoring. 0.7 is solid, 0.4 is a coin flip dressed up as a number.

What is judge optimization?

Judge optimization is the process of automatically improving an LLM judge's evaluation prompt so that its scores agree more closely with human labels. The judge stays the same model. The prompt the judge runs gets tuned, often dramatically.

A typical Gen 1 LLM-as-Judge launches with an off-the-shelf rubric prompt and lands somewhere around 70 percent agreement with a human grader. After judge optimization, the same model on the same task can land at 85 to 92 percent agreement on the same labeled set. The model did not get smarter. The instructions did.

This is the cheapest accuracy lift available in any continuous evaluation stack. Most teams skip it because there is no obvious place to put a prompt-tuning loop in a production pipeline. The Stratix Learning Hub provides one.

What is GEPA?

GEPA stands for Genetic Evolutionary Prompt Adaptation. It is a prompt-optimization algorithm that treats the judge prompt as a genome and evolves it over generations against a labeled fitness set.

The full loop:

1. Start with a seed judge prompt and a labeled set of input-output-grade triples (a few hundred is usually enough).

2. Generate a population of candidate prompts by mutating the seed (paraphrases, added rubric clauses, reordered criteria, removed clauses).

3. Score every candidate by running it as a judge against the labeled set and measuring agreement with the human labels.

4. Keep the top performers, breed them by combining their best clauses, and mutate again.

5. Stop when agreement stops improving across a generation, or when the budget is exhausted.

The output is a prompt that scores higher on the held-out set than the seed and is usually shorter and more specific.

Why does an LLM judge prompt need optimization at all?

A judge prompt is doing three things at once: defining the rubric, defining the evidence the judge can use, and defining the scoring scale. Each of those is a place where a hand-written prompt can leak ambiguity.

Common ways a default judge prompt fails:

Rubric ambiguity. "Rate the response from 1 to 5 on quality." Quality is unscoped. The judge averages across helpfulness, correctness, tone, and length, all weighted differently each time the judge runs. Variance is high. Agreement with humans is low.

Evidence underspecification. A judge that can re-read the system prompt and the tool calls scores differently than a judge that only sees the final response. A default prompt rarely says which.

Scale collapse. A 1-to-5 scale where the judge issues 4s and 5s 88 percent of the time loses resolution. The scale is real but the judge is not using it.

Position bias and length bias. LLM judges over-prefer longer, earlier-listed, or more verbose answers. Default prompts do not correct for it.

A tuned prompt closes each of these gaps with explicit clauses. Hand-tuning works for one judge. GEPA scales.

When should a team run GEPA?

The high-impact triggers are predictable.

At launch. A new judge ships with a seed prompt that is, by definition, untuned. The first GEPA pass against a small labeled set will move the score floor by 8 to 20 points.

After a model swap. When the underlying judge model changes (Sonnet to Opus, Gemini to GPT-5.5, open to closed source), the prompt that worked on the old model is rarely optimal on the new one. The instruction-following profile is different.

On a 30-day clock. Judge drift is real. A prompt tuned in March against a March-labeled set degrades by June as production traffic shifts. A monthly cadence is cheap insurance.

After a label-set expansion. Every time human labels are added (new categories, edge cases, regulatory criteria), the labeled set has shifted. The prompt should re-tune against it.

How does GEPA score candidate prompts?

The fitness function is the choke point. GEPA agreement with humans is measured per-example, then aggregated. The aggregation matters.

Cohen's kappa is the right starting metric for ordinal scoring. It corrects for agreement that would happen by chance. A kappa of 0.7 is solid agreement. 0.4 is a coin flip dressed up as a number.

Per-class precision and recall matter when the labeled distribution is imbalanced (most production traffic is fine, rare cases are the failures). A judge that hits 90 percent overall agreement and zero percent recall on the failure class is useless in production.

Tail-case agreement. GEPA can be configured to weight rare failure cases higher in the fitness function. This is usually what teams want. The cost of a missed hallucination is asymmetric.

The Stratix Learning Hub has worked GEPA runs on three labeled sets (RAG correctness, agent tool use, summarization faithfulness) so teams can see the loop end to end before wiring it on their own data.

What does a GEPA run look like in practice?

A typical GEPA run does 5 to 10 generations of 20 to 50 candidate prompts each. That is 100 to 500 judge invocations against a labeled set of 200 to 500 examples. A full pass takes minutes, not hours, on a Gen 1 LLM-as-Judge.

The recurring effort (monthly re-tune) is small compared to the alternative, which is shipping with an untuned judge and discovering the calibration gap from a customer escalation.

How does GEPA fit into a wider continuous evaluation stack?

GEPA is one component. The full stack runs in this order:

1. Define the task and label a few hundred examples.

2. Pick a judge model and write a seed prompt.

3. Run GEPA against the labeled set to tune the prompt.

4. Wire the tuned judge against production traces with step-level scoring.

5. Anchor every score to a model version, prompt hash, and judge version.

6. Re-run GEPA on a 30-day cadence and on every model swap.

The output is a judge that holds its calibration over time, fails loudly when the underlying system shifts, and produces scores a team can ship a release on.

Run GEPA in the Stratix Learning Hub

The Hub has three runnable GEPA templates with labeled data, seed prompts, and reference scores so a team can see the optimization loop produce a real lift before wiring it on production data. Open the Hub here.

Key Takeaways

Judge prompts ship with rubric ambiguity, evidence underspecification, scale collapse, and position or length bias. GEPA closes each gap with explicit clauses learned from data.
Run GEPA at launch, after every judge model swap, on a monthly cadence, and after any label-set expansion.
Weight rare failure cases higher in the GEPA fitness function. The cost of a missed hallucination is asymmetric.
GEPA is one component of a continuous evaluation stack. Calibrated judges plus step-level scoring plus version anchoring is the working triple.

Frequently Asked Questions

What does GEPA stand for?

GEPA stands for Genetic Evolutionary Prompt Adaptation. It is a prompt-optimization algorithm that treats the judge prompt as a genome and evolves it over generations against a labeled fitness set.

How much labeled data does GEPA need?

A few hundred labeled input-output-grade triples is usually enough for the first pass. Additional labels on rare failure cases improve the algorithm's coverage on the long tail.

How often should a team re-run GEPA?

On a 30-day cadence is the most common pattern. Also re-run after any judge-model swap, after a labeled-set expansion, and any time the underlying production traffic distribution shifts. Judge drift is real and a monthly pass is cheap insurance.

Does GEPA change the judge model?

No. GEPA only changes the prompt the judge runs. The model stays the same. The lift comes entirely from better instructions.

Methodology

Lift figures (8 to 20 points of Cohen's kappa improvement) reflect aggregated results from documented GEPA passes across the Stratix Learning Hub's three reference labeled sets (RAG correctness, agent tool use, summarization faithfulness). Individual results vary with seed-prompt quality, label-set size, and task difficulty.

Open the Stratix Learning Hub to run the templates and worked examples referenced in this post against your own data.

‹ Step-Level Evaluation vs Output-Level Evaluation for AI Agent Traces

What Is Continuous Evaluation? A Working Definition for Production AI Teams ›