Section 27.1: Production Evals - AI-Powered Products

"The moment your AI ships to production, the real eval begins. Every user interaction is data. The question is whether you are collecting it."
A Principal Engineer Who Watches Their Metrics

Why Production Evals Matter

Chapter 21 introduced evaluation as the core development discipline for AI products. That chapter focused on pre-launch evaluation: building evals before you write feature code, using them to guide development, and running them in CI to catch regressions before shipping. But the launch does not end the eval story. It begins a new chapter.

Production evals measure what your AI actually does with real users, real data, and real stakes. Pre-launch evals tell you if your system meets specifications. Production evals tell you if your specifications were right in the first place.

The Eval Lifecycle

Pre-launch evals optimize what you expect. Production evals discover what you did not expect. The most dangerous failures in AI products are not the ones you anticipated and chose to accept. They are the ones you never imagined.

Running Evals on Live Traffic

Live traffic evaluation means periodically sampling production requests and running them through your evaluation pipeline. You capture the inputs your AI received in production, process them through your eval framework, and measure how your system performs on real-world data.

Sampling Strategies

You cannot evaluate every production request. The computational cost would be prohibitive, and you do not need to. Effective sampling strategies give you statistically meaningful signal without evaluating everything.

Four Sampling Approaches

Random sampling evaluates a random percentage of production requests. It is simple and unbiased but may miss rare edge cases.

Stratified sampling ensures you capture examples from each important segment including new users, power users, specific geographic regions, and high-stakes interactions.

Tail-based sampling prioritizes requests that exhibit concerning patterns such as high latency, unusual input lengths, failed requests, or flagged content.

Configurable sampling builds your system to allow sampling rate adjustments without deployment, so when you ship a risky change you can increase sampling temporarily.

Eval Pipeline in Production

Your eval pipeline from Chapter 21 does not change when you ship. It runs the same way it did in development, except now it consumes live data. The pipeline captures production requests including input, output, and metadata, then stores them in an eval buffer with appropriate privacy handling. Periodically, it runs an eval batch against your test suite, aggregates results and updates dashboards, and alerts on regressions or significant changes.

Practical Example: QuickShip Route Quality Monitoring

Who: QuickShip engineering team monitoring delivery route suggestions

Situation: After launch, the team noticed complaints about routes that seemed suboptimal

Problem: Pre-launch evals used historical routes, but production revealed patterns they never anticipated

Dilemma: How to measure route quality without manual review of every suggestion?

Decision: Implemented live traffic sampling with automated route comparison against known-good baselines

How: Captured 5% of production routes, ran them through their offline eval system, tracked deviation from expected routes. Alert threshold: >15% deviation triggers review.

Result: Caught a data pipeline issue causing 8% of routes to ignore real-time traffic. Fixed within 48 hours of detection.

Lesson: Production evals caught what pre-launch evals missed: the interaction between your system and the live world.

A/B Testing AI Outputs

A/B testing for AI products requires different thinking than A/B testing for traditional software. With deterministic systems, you test which version works. With probabilistic systems, you test distributions of outcomes.

Unique Challenges for AI A/B Tests

AI A/B Testing Challenges

Output variance: The same input produces different outputs, requiring larger sample sizes to distinguish real differences from random variation.

Evaluation latency: The effect of an AI output may not be visible for minutes, hours, or days since route quality shows in delivery times and recommendation quality shows in repeat purchases.

Cold-start problems: New users have no baseline behavior and early interactions with AI may determine long-term engagement.

Contextual effects: The same AI output may work well for one user and poorly for another, so simple conversion metrics hide these differences.

Experiment Design for AI

Effective AI experiments require clear primary metrics defining what single outcome matters most, such as assisted sales conversion for RetailMind or on-time delivery rate for QuickShip. They also require guardrail metrics determining what must not degrade, since even if conversion improves you cannot accept higher return rates or worse customer satisfaction. Segmentation analysis breaks results by user segment because a winning experiment for new users may lose for power users. Experiments need long enough runtime to capture full user journeys since AI recommendations that seem engaging may lead to returns or churn.

A/B Testing Prompts, Not Just Models

Most AI A/B testing focuses on model selection. But prompt changes often have larger impact than model changes, and they deploy faster. Test prompt variations systematically. The difference between "Can you help me find products?" and "What are you looking for today?" may matter more than switching from one model to another.

Shadow Mode Deployments

Shadow mode runs your new AI in parallel with production, consuming real requests but not returning results to users. The new system observes; the old system responds.

When to Use Shadow Mode

Shadow mode serves multiple purposes including testing a new model version before full rollout, validating a new prompt strategy with real inputs, calibrating confidence thresholds before enabling automated actions, and collecting training data for future model improvements.

Implementation Considerations

Shadow mode requires careful infrastructure. Request routing must copy requests to both systems, latency overhead must not impact response time, outputs from the shadow system must be logged but not exposed, and clock synchronization enables accurate comparison with production.

Practical Example: RetailMind Shadow Mode for Shopping Assistant

Who: RetailMind team testing a new shopping assistant prompt

Situation: The team had a prompt that performed well in offline evals but worried about edge cases in production

Problem: How to test with real in-store traffic without risking bad customer experiences?

Decision: Shadow mode deployment for 2 weeks before any user-facing rollout

How: New prompt processed all requests alongside production prompt. Outputs logged but users saw only production responses. Comparison automated: did shadow responses differ meaningfully?

Result: Shadow mode caught 3 edge cases where new prompt suggested out-of-stock items or misunderstood store layout questions. Fixed before any customer saw them.

Lesson: Shadow mode lets you learn from production without production risk.

Canary Releases for AI

Canary releases gradually shift traffic from old to new, starting with a small percentage and increasing as confidence grows. For AI products, canaries require monitoring beyond traditional metrics.

Staged Canary for AI

Staged canary releases for AI follow a progression:

The Four Stages of AI Canary

1-5% rollout: Monitor basic health with no user-visible impact, checking error rates, latency, and system health.

5-25% rollout: Begin statistical monitoring comparing AI output distributions between old and new.

25-50% rollout: Segment monitoring watches for differential impact across user segments.

50-100% rollout: Full deployment with continued monitoring, recognizing that canary never fully ends and you always watch for regressions.

AI-Specific Canary Metrics

AI-specific canary metrics monitor output distribution shift to detect if the distribution of AI outputs has changed significantly, confidence score changes to verify whether AI confidence scores track with actual accuracy, error category shifts to identify if you are trading one type of error for another, and user feedback rates to observe whether users are flagging or correcting AI outputs more or less.

Canary Is Insurance, Not Just Rollout

Think of canary releases as insurance against unknown failure modes. The cost is the infrastructure to run two systems and the attention to monitor them. The benefit is catching problems before they affect everyone. For high-stakes AI applications, canary releases are not optional.