9.4 Acceptance Criteria for AI

Objective: Learn to define and verify acceptance criteria for AI features that embrace probabilistic behavior while maintaining quality standards.

"Acceptance criteria for AI must define not just what success looks like, but how often success is required and how we measure it."
Testing AI Systems

AI acceptance criteria differ from traditional software acceptance criteria. Rather than pass/fail on specific behaviors, AI acceptance criteria define statistical thresholds and quality distributions.

The AI Acceptance Criteria Challenge

Traditional software acceptance: "Given input X, system returns output Y"

AI acceptance criteria: "Given inputs similar to X, system returns outputs meeting quality threshold Z% of the time"

The 85% Problem

"85% accuracy" sounds good until you realize it means 15 out of 100 diagnoses are wrong. In healthcare. Per day. That's a lot of wrong. But it's still "passing" the acceptance criteria.

Types of AI Acceptance Criteria

Acceptance Criteria Categories

1. Performance Metrics

Accuracy, precision, recall, F1, AUROC, BLEU, ROUGE, etc.

2. Quality Thresholds

"90% of outputs must be rated 'good' by human evaluators"

3. Error Rate Limits

"False positive rate must be below 5%"

4. Latency Requirements

"95th percentile response time under 2 seconds"

5. Fairness Criteria

"Performance variance across demographic groups under 5%"

Defining Measurable Thresholds

Threshold Setting Principles

Threshold setting follows several important principles. Teams should establish baseline first by measuring current performance before setting targets, ensuring targets are realistic. They should consider user tolerance by understanding what error rate users actually accept rather than assuming. Business constraints matter by determining what thresholds make business sense given resource limitations. Statistical validity requires ensuring sample sizes support the thresholds being set. Iteration path means thresholds should be achievable in reasonable iterations rather than requiring moonshots.

Eval-First in Practice

Before setting acceptance criteria thresholds, define how you will measure whether thresholds were set correctly. A micro-eval for acceptance criteria tracks: threshold attainment rate in production, user satisfaction at threshold-met versus threshold-exceeded, and regression detection rate. QuickShip's eval-first insight: their initial acceptance criteria required 85% accuracy on route suggestions. After measuring, they found that routes meeting 75% accuracy had equal user satisfaction because users could easily override. They relaxed thresholds to focus engineering on higher-value features.

Verification Approaches

Automated Testing

For criteria that can be automatically measured, teams should use unit tests for individual model outputs, integration tests for pipelines, and regression tests against known-good outputs to catch regressions quickly.

Human Evaluation

For criteria requiring human judgment, teams should use sample-based review of outputs to assess quality on representative cases, side-by-side comparison of alternatives to determine preference, and user satisfaction surveys to capture user experience.

Statistical Testing

For verifying performance claims statistically, teams should use confidence intervals for metrics to understand the range of possible values, significance testing for improvements to ensure changes are real rather than noise, and A/B testing in production to measure actual user behavior.

Acceptance Criteria Example

Feature: AI Meeting Summarizer

ACCEPTANCE CRITERIA:

AC-1: Action Item Extraction
  Metric: Recall of action items
  Target: > 85% recall
  Verification: Human evaluation on 100-meeting dataset
  Minimum: 80% recall

AC-2: Summary Coherence  
  Metric: LLM-as-Judge coherence score
  Target: Average score > 4.0/5.0
  Verification: Automated evaluation
  Minimum: 3.5/5.0

AC-3: False Positive Rate
  Metric: % of non-action items marked as actions
  Target: < 10%
  Verification: Human evaluation on dataset
  Minimum: < 15%

AC-4: Latency
  Metric: Time to generate summary
  Target: p95 < 5 seconds
  Verification: Automated performance testing
  Minimum: p95 < 10 seconds

AC-5: Fairness
  Metric: Performance variance by speaker accent
  Target: < 5% variance in recall
  Verification: Stratified evaluation dataset

Acceptance vs. Rejection

Define explicit rejection criteria, not just acceptance. Hard blocks are criteria that must pass or the feature cannot ship under any circumstances. Warnings are criteria that trigger review but may allow shipping with appropriate sign-off. Deferred criteria are those deemed important but to be addressed in future iterations rather than blocking current release.

Common Acceptance Criteria Mistakes

Teams should avoid several common acceptance criteria mistakes. Vague thresholds such as "good quality" are not measurable and leave teams without clear targets. No minimums means operating without floor thresholds that protect against poor quality. Ignoring error types means failing to recognize that not all errors are equal in severity or impact. Single-metric focus means optimizing for one metric at the expense of others, creating unintended consequences.

Acceptance Criteria Document

Each AI feature should have a documented acceptance criteria that lists all acceptance criteria with their specific metrics and thresholds, specifies how each criterion will be verified and by whom, identifies which criteria are hard blocks versus warnings, documents both baseline and target values, and includes responsible parties for verification to ensure accountability.

What's Next?

This completes Chapter 9. You have learned about the USID.O framework, eval-first PRDs, LLM-as-Judge evaluation, and acceptance criteria for AI. Next, in Chapter 10, we explore AI Cost and ROI, understanding the economics of AI product development.