Part II: Discovery and Design
Chapter 9

Requirements for Probabilistic Products

9.3 LLM-as-Judge for Requirements

Objective: Learn to use LLM-as-Judge techniques for evaluating AI outputs, understanding both the power and limitations of this approach.

"Using AI to evaluate AI is like using a scale to calibrate a scale. It works, but you need reference weights."

Practical AI Evaluation

9.3 LLM-as-Judge for Requirements

LLM-as-Judge uses AI models to evaluate AI outputs. This technique enables scalable evaluation but requires careful design to ensure reliability.

Why LLM-as-Judge

Traditional evaluation methods have limitations. Human evaluation is scalable but expensive and slow. Automated metrics are fast but may not capture nuanced quality. Rule-based evaluation is limited to easily specified criteria and cannot handle complex or subjective aspects of quality.

LLM-as-Judge offers a middle path: flexible, scalable evaluation using AI that can understand nuanced quality.

The LLM-as-Judge Approach

The Circular Problem

You're using AI to evaluate AI. It's like asking the chef to taste their own cooking and rate it. The chef might genuinely believe it's delicious. The diners are still chewing.

LLM-as-Judge Architecture

1. Define Evaluation Criteria

What aspects of output quality matter? E.g., relevance, accuracy, coherence, helpfulness.

2. Create Reference Outputs

Examples of good and bad outputs for calibration.

3. Design Judge Prompt

Instructions for how the judge model should evaluate.

4. Run Evaluation

Apply judge to outputs being evaluated.

5. Calibrate and Validate

Compare judge to human evaluation and refine.

Designing the Judge Prompt

Effective Judge Prompt Elements

Effective judge prompts include several key elements. Clear criteria specify what the judge should look for when evaluating outputs. Scoring rubric defines how to score each criterion, providing a consistent scale for assessment. Reference examples show what good and bad outputs look like, helping the judge calibrate its evaluations. Chain-of-thought instructions ask for reasoning before scoring, which improves evaluation quality by forcing the judge to articulate its thinking. Output format specifies the exact structure of the evaluation output, ensuring consistency and parseability.

Judge Prompt Example
You are evaluating AI-generated meeting summaries.

CRITERIA:
1. Actionability: Are action items clearly identified?
2. Accuracy: Does the summary reflect what was said?
3. Completeness: Are key points covered?

SCORING: Rate each criterion 1-5

REFERENCE GOOD:
"Discussed Q3 roadmap. Action: Sarah to finalize 
budget by Friday. Decision: Launch date set for Oct 15."

REFERENCE BAD:
"Meeting happened. Topics discussed."

OUTPUT FORMAT:
{
  "actionability": {"score": 1-5, "reasoning": "..."},
  "accuracy": {"score": 1-5, "reasoning": "..."},
  "completeness": {"score": 1-5, "reasoning": "..."}
}
            

Calibration and Validation

LLM-as-Judge must be calibrated against human judgment through several methods. Correlation analysis examines how well the judge agrees with humans on the same outputs. Calibration curves verify that higher scores actually correlate with higher quality outputs. Bias detection checks whether the judge favors certain output styles systematically, which would indicate a reliability problem.

LLM-as-Judge Limitations

LLM-as-Judge has several known limitations that teams should monitor. Length bias occurs when the judge may favor longer outputs even when length does not correlate with quality. Position bias occurs when the judge may favor first or last options in comparisons. Self-preference occurs when the judge may favor outputs similar to its own writing style. Calibration drift means judge quality may vary over time as model behavior changes.

Eval-First in Practice

Before deploying LLM-as-Judge for production evaluation, define how you will measure whether the judge itself is reliable. A micro-eval for LLM-as-judge tracks: correlation with human evaluation (should be > 0.8), consistency rate (same input gets same score), and calibration accuracy (does higher score predict better outcomes?). A team's eval-first insight: they discovered their LLM-as-judge had 0.72 correlation with human evaluators initially, but after adding reference examples and chain-of-thought reasoning, correlation improved to 0.89. Without measuring this, they would have deployed an unreliable evaluation system.

Best Practices

Teams should follow several best practices when using LLM-as-Judge. Use LLM-as-Judge as a complement to human evaluation rather than a complete replacement, maintaining human oversight for critical decisions. Calibrate regularly by validating judge outputs against human judgment frequently to ensure reliability. Ensemble judges by using multiple judges for critical evaluations to reduce individual bias. Report confidence by including uncertainty estimates in judge outputs so downstream decisions can account for reliability.

What's Next?

Next, we explore Acceptance Criteria for AI, understanding how to define and verify acceptance criteria for AI features.