"Write the test first. Watch it fail. Write the code. Watch it pass. This is not just a testing strategy. It is a design philosophy."
A Developer Who Treats Requirements as Testable Hypotheses
Write Evals Before Code
Eval-driven development inverts the traditional workflow. Instead of building first and testing later, you define what success looks like before you start building.
The Eval-First Workflow
The sequence is: Define the eval (what does good look like?), run the eval against a baseline (what do we have now?), implement the feature (build toward the eval), run the eval again (did we get there?), iterate (keep improving until the eval passes).
Benefits of Eval-First
Eval-first development provides several benefits that accelerate quality and focus. Clear success criteria emerge from the process, giving you an exact target to build toward rather than an abstract goal. Baseline comparison becomes possible because you measure improvement from a known starting point, making the value of changes concrete and quantifiable. Focused development results because you avoid building features that do not matter for the eval, concentrating effort on what actually moves the needle. Incremental progress is achievable because the eval tells you when you are done, removing the ambiguity that often extends projects indefinitely.
Red-Green-Refactor for AI
Apply the red-green-refactor pattern from test-driven development to AI systems.
Red: Write a Failing Eval
Start by writing an eval that fails against your current system. The eval defines what you want to achieve. This phase forces you to think concretely about what "good" looks like.
Writing the Red Eval
Writing the red eval involves identifying the behavior you want, specifically what the system should do that it does not currently do. Create test cases that capture this desired behavior, including both positive cases that should work and negative cases that should not happen. Define success criteria by determining what score or pass rate constitutes success for the eval to be considered passing. Run the eval against the current system to confirm it fails, and understand why it fails to clarify what needs to be built. Finally, document the gap between current performance and success criteria, as this becomes the target that guides development efforts.
Green: Make the Eval Pass
Implement the feature that makes the eval pass. This may involve prompt changes, model changes, retrieval improvements, or architectural changes.
Refactor: Improve Without Breaking
Once the eval passes, refactor to improve the implementation without degrading the eval score. The eval is now a regression guard.
Continuous Evaluation in CI/CD
Integrate eval execution into your CI/CD pipeline to catch regressions before they reach production.
Eval Pipeline Design
CI/CD Eval Pipeline
The CI/CD eval pipeline executes on every code change to ensure quality is maintained throughout the development process. The first step runs the fast regression suite consisting of task evals that complete in under five minutes, failing the build if thresholds are breached. If the fast suite passes, the pipeline runs the full eval suite including all eval types with slower system evals, completing in under thirty minutes with results stored for trend analysis. If the full suite reveals regressions, the pipeline blocks deployment, generates a report for the developer, and creates an issue in the tracking system for tracking. On merge to main, the pipeline runs comprehensive eval including agent evals, updates the performance dashboard, and generates a delta report comparing results against the previous main branch.
Eval Gates
Eval gates are thresholds that must be met before code can progress through the pipeline, providing quality checkpoints at different stages of development. The fast gate runs on every commit with strict thresholds and requires fast execution to avoid slowing down developers. The quality gate runs before merge with tighter thresholds and comprehensive coverage to ensure changes meet higher standards. The release gate runs before deployment with the highest thresholds and full regression coverage to protect production quality.
Eval Results as Product Metrics
Eval results are product metrics. Track them over time to understand quality trends, measure the impact of changes, and make prioritization decisions.
Eval Metric Dashboards
Build dashboards that visualize eval results and trends to make quality visible and actionable. The dashboard should show current performance by answering how the system is performing right now. It should display trend over time to show whether quality is improving or degrading across releases. A component breakdown helps identify which components are failing and need attention. User impact assessment connects eval failures to user experience, clarifying the real-world consequences of quality issues.
Alert on Trends, Not Just Thresholds
Set up alerts for degrading trends even if you have not yet breached thresholds. A 2% decline per week is more alarming than a single 5% breach.
Eval Infrastructure
Scalable eval infrastructure enables effective eval-driven development. Invest in building and maintaining this infrastructure.
Eval Pipelines and Automation
Automate everything you can to reduce manual overhead and ensure consistency. Test data management should be automated including creation and refresh of test datasets so they remain current without manual intervention. Eval execution should automate the running of eval suites on the appropriate schedule or triggers. Result aggregation should automatically collect and analyze results, storing them for trend analysis and reporting. Failure reporting should generate automated alerts and create issues in the tracking system when failures occur.
Versioning Eval Datasets
Treat eval datasets as versioned artifacts, not ad hoc collections. When you change an eval dataset, you change what you are measuring.
Dataset Version Control
Version your eval datasets with the same rigor you apply to code. Track what changed, why it changed, and how the change affects comparability with previous results.
Regression Test Suites
Build comprehensive regression suites that cover the behaviors that matter most to your product and users. Critical behaviors including task completion, safety, and compliance should be tested in every regression run since failures here cause the most harm. Common paths covering high-frequency user interactions should be included because they affect the most users. Known failure modes representing problems you have seen before should be tested to ensure they do not recur. Recent changes should be tested for behaviors affected by new code to catch regressions introduced by specific modifications.
Eval Dashboards and Reporting
Make eval results visible and actionable. Teams improve what they measure.
Stakeholder Reporting
Different stakeholders need different views of eval results tailored to their concerns and decision-making needs. Engineering teams need detailed eval results with failure cases and debugging information they can use to identify and fix problems. Product teams need quality trends, comparison with targets, and user impact assessment to make prioritization decisions and understand how quality affects user satisfaction. Leadership needs high-level quality metrics with risk indicators and trajectory to understand overall product health and make strategic decisions about investment and risk tolerance.
Practical Example: Building an Eval-Driven Workflow
The HealthMetrics team was building AI-assisted diagnosis support and needed to improve their AI's accuracy across fifty diagnostic categories while maintaining rigorous safety standards. Their problem was that without clear eval criteria, engineers optimized for metrics that did not correlate with clinical outcomes, wasting effort on improvements that did not matter. They faced a dilemma: how do you make evaluation central without slowing down rapid iteration?
The team decided to implement an eval-first workflow with an automated eval pipeline. They built a golden standard dataset with five thousand clinician-labeled cases to establish ground truth for evaluation. They created an automated eval pipeline that ran on every code change, with quality gates that blocked deployment if accuracy dropped below thresholds. They created a dashboard showing accuracy by diagnostic category so teams could see exactly where they stood.
Within six months, accuracy improved from seventy-eight percent to ninety-one percent across categories. They achieved zero safety incidents during this period. Deployment cadence actually increased because confidence in changes was higher. The lesson is that eval infrastructure is an investment that accelerates development, not a brake on it. Teams with good evals move faster because they have confidence in their changes.
What's Next
Eval-driven development connects to observability and reliability disciplines covered in the next chapters. Chapter 22 on observability covers instrumenting AI systems to capture the data that makes evaluation possible, as understanding what is happening inside your AI system is prerequisite to evaluating whether it is happening correctly. Chapter 23 on reliability addresses building guardrails and recovery mechanisms based on eval insights, since evals reveal failure modes and reliability patterns prevent those failures from reaching users. Chapter 27 on post-launch covers production evaluation and continuous monitoring, recognizing that the eval loop never stops and production monitoring is eval infrastructure running in real-time.
The Eval Thread
Evaluation is the thread that connects all parts of AI product development. It informs requirements (what does good look like?), guides implementation (are we there yet?), validates quality (did we succeed?), and enables improvement (what should we fix next?). Without this thread, development becomes guesswork.
Exercise: Eval-Driven Development Audit
Audit your current AI product development process by working through these steps. First, list your current evaluation methods for each major feature to understand what you are already measuring. Second, identify gaps where you are relying on implicit rather than explicit evaluation, meaning cases where quality is judged informally rather than through structured evals. Third, for each gap, design an eval that would make quality visible and measurable. Fourth, estimate the effort to implement each eval to understand the investment required. Fifth, prioritize the gaps and create a plan to close them, starting with high-impact evals that are relatively easy to implement.
Click for example audit template
| Feature | Current Eval | Gap | Proposed Eval | Effort | Priority |
|---------|--------------|-----|---------------|--------|----------|
| Search | Manual test | No | Automated | Medium | High |
| | of queries | metric| precision/ | | |
| | | tracking| recall | | |
Key Takeaway
Eval-driven development is not an additional phase. It is the discipline that makes every other phase effective. By making quality measurable before you build, you give yourself the ability to know when you have succeeded.
Bibliography
Foundational Papers
-
Comprehensive study of LLM-as-Judge reliability and failure modes, providing benchmark for judge quality.
-
Wang, J., et al. (2023). "Self-Eval: Automatic Evaluation of LLMs as Judges." arXiv:2312.09210.
Proposes methods for improving LLM judge reliability through self-evaluation and calibration.
Tools and Frameworks
-
OpenAI Evals. (2024). OpenSource Evaluation Framework.
Production-grade eval framework supporting diverse eval types and integration with CI/CD.
-
LLM Jury. (2024). Structured LLM Evaluation Framework.
Implements structured evaluation protocols with multiple judge models and disagreement detection.
Industry Case Studies
-
Anthropic. (2024). "Building Effective Evals for Claude."
Detailed guide to building evals for production AI systems, with focus on practical implementation.