9.2 Eval-First PRDs

Objective: Learn to write product requirements documents that prioritize evaluation criteria before feature implementation.

"If you cannot measure it, you cannot ship it. For AI features, measurement must be designed in, not bolted on."
AI Product Development Principles

The PRD Time Paradox

You spent 3 weeks writing the perfect eval-first PRD with detailed metrics. The engineer built a different feature because "the PRD didn't capture what the user actually needed." The metrics passed. The users hated it. Ship it anyway?

Eval-first PRDs (Product Requirements Documents) invert the traditional PRD structure. Instead of starting with features and hoping evaluation follows, eval-first PRDs start with evaluation criteria and let features emerge from the requirements.

The Eval-First PRD Structure

Eval-First PRD Sections

1. User Problem and Success Definition

What user problem are we solving? How does success look to users?

2. Evaluation Criteria (First!)

How will we measure success? What metrics, thresholds, and benchmarks define success?

3. Evaluation Methodology

How will we evaluate against our criteria? What datasets, human evaluation, and automated metrics?

4. Feature Requirements

What features emerge from the evaluation criteria? How do they map to measurable outcomes?

5. Rollout and Monitoring

How will we phase rollout? What production monitoring is required?

Why Evaluation First?

Starting with evaluation ensures agreement on success where the team agrees on metrics before building, prevents scope creep where features must tie to measurable outcomes, enables iteration where clear metrics allow rapid improvement, and reduces risk because teams cannot ship what they cannot measure.

Traditional vs Eval-First PRD

Traditional PRD: A traditional PRD describes a feature such as AI-powered search, provides a description saying to improve search results using AI, and defines success vaguely as better search results without measurable criteria.

Eval-First PRD: An eval-first PRD defines the problem clearly as users cannot find relevant documents, specifies the metric as NDCG@10 greater than 0.85 defined upfront, establishes a baseline of current search NDCG@10 at 0.72, sets a target of NDCG@10 improvement of 10%, and describes evaluation using automated benchmark on held-out queries.

Writing Evaluation Criteria

Good Evaluation Criteria Are:

Good evaluation criteria share several key properties. Measurable means criteria can be quantified objectively without subjective judgment. Specific means criteria have precise thresholds rather than vague goals that are open to interpretation. Actionable means teams know exactly what to optimize to meet the criteria. Comprehensive means criteria cover both success and failure modes, not just the happy path. Feasible means criteria are achievable with available resources and constraints.

Eval-First in Practice

Before writing any eval-first PRD, define how you will measure whether the PRD process itself is working. A micro-eval for PRD quality tracks: metric attainment rate (did features meeting eval criteria succeed in production?), false positive rate (did features pass eval but fail in production?), and iteration velocity (how quickly can you iterate based on eval results?). A team's eval-first insight: they measured that features with eval-first PRDs had 40% higher production success rates than features with traditional PRDs, justifying the upfront eval investment.

Evaluation Methodology

Specify how evaluation will be conducted by defining several components. Automated metrics specify what can be measured automatically without human intervention. Human evaluation specifies what requires human judgment and cannot be captured by automated means. Test datasets specify what data evaluation will use, including how datasets are curated and maintained. Statistical validity specifies what confidence levels are required for results to be considered meaningful.

PRD Template

Eval-First PRD Template

# Product Name

## User Problem
[What problem are we solving?]

## Success Definition
[How does success look to users?]

## Evaluation Criteria (Priority Order)
1. Metric: [name]
   - Target: [threshold]
   - Baseline: [current value]
   - Measurement: [how to measure]

2. Metric: [name]
   - Target: [threshold]
   [...]

## Evaluation Methodology
[How will evaluation be conducted?]

## Feature Requirements
[What features support these metrics?]

## Rollout Criteria
[What must pass before launch?]

## Monitoring Plan
[What metrics in production?]

What's Next?

Next, we explore LLM-as-Judge for Requirements, understanding how to use AI itself to evaluate AI outputs.