9.1 The USID.O Framework for AI

Objective: Learn the USID.O framework for structuring AI product requirements that embrace probabilistic behavior while maintaining clear acceptance criteria.

"Traditional requirements assume deterministic behavior. AI products require requirements that embrace probability, define acceptable error rates, and specify evaluation criteria alongside functional requirements."
AI Product Requirements Handbook

The Requirements Reality

The PM writes: "AI should work correctly." The engineer asks: "What does correctly mean?" The PM says: "You know, correctly." This is why AI requirements are hard, and why 90% of AI PRDs are still just "make the AI do the thing."

The USID.O framework (Understand, Specify, Implement, Deliver, operate) provides a structured approach to AI product requirements that accounts for the probabilistic nature of AI systems.

Why Traditional Requirements Fail for AI

Traditional product requirements assume deterministic output where the same input always produces the same output, clear acceptance where requirements are either met or not met, complete specification where all edge cases can be enumerated, and perfect testability where every requirement can be definitively verified.

AI systems violate all of these assumptions. They are probabilistic, their outputs vary, edge cases are numerous, and verification is statistical.

The USID.O Framework Overview

U - Understand: Define what success looks like in user terms, not AI terms

S - Specify: Create measurable acceptance criteria including error rates and confidence thresholds

I - Implement: Build evaluation infrastructure before building features

D - Deliver: Define rollout criteria based on evaluation results

O - Operate: Monitor in production and iterate based on real-world performance

Eval-First in Practice

The USID.O framework is itself an eval-first methodology. For each AI feature, define how you will measure success at every stage. A micro-eval for USID.O implementation tracks: evaluation dataset quality before implementation, threshold attainment rates at each milestone, and production metric correlation with eval metrics. QuickShip's eval-first insight: they discovered their eval metrics (accuracy, latency) only weakly correlated with user satisfaction scores. They added UX metrics to their evaluation framework and saw NPS improve 15 points because they were optimizing for what users actually cared about.

U: Understand (User Outcomes First)

Start with user outcomes, not AI capabilities. Teams should ask what user problem the AI solves, how success looks to the user, what the cost of AI errors is to users, and what users are willing to tolerate in terms of accuracy tradeoffs.

Understanding Example: AI Meeting Summarizer

Wrong start: "Summarize meetings with 90% accuracy"

Right start: "Help users find action items from meetings without reading full transcripts"

User outcome question: "What percentage of action items should be captured?"

Error cost question: "What happens if we miss an action item?"

Common Misconception

Accuracy percentages obscure distribution. Saying "90% accurate" sounds good but tells you nothing about which 10% fails. A system that misses 10% of easy cases is very different from one that fails on 10% of hard cases. Specify what kinds of errors are acceptable and which are not. A 90% accurate medical diagnosis AI that misses 10% of cancers is not 90% good; it is dangerous.

Running Product: HealthMetrics Analytics

HealthMetrics applies the USID.O framework to their clinical data coordination product. For the Understand phase, hospital administrators need to reduce patient wait times and improve resource utilization without compromising care quality. For the Specify phase, AI suggestions must be explainable to clinical staff, achieve 90% accuracy on bed assignments, and defer to human judgment when confidence is below 75%. For the Implement phase, they build an eval pipeline with clinical safety as the primary metric before implementing any routing logic. For the Deliver phase, they roll out to one department first, measure escalation rates, and expand only when human override rate is below 5%. For the Operate phase, they monitor trust trajectory and if clinical staff consistently overrides AI, the eval metrics are not matching user needs.

Their eval-first approach revealed that administrators cared more about "explainability to patients" than they had initially specified, leading to a redesigned UX that showed patients why a particular bed assignment was recommended.

S: Specify (Measurable Criteria)

Specify requirements that can be measured statistically:

Specifying AI Requirements

Specifying AI requirements involves defining measurable criteria across several dimensions. Accuracy metrics include precision, recall, F1, and AUROC (see the evaluation methods section for details). Error rate thresholds specify acceptable false positive rates such as "no more than 5% false positives." Confidence calibration ensures that "90% confidence should be correct 90% of the time," indicating well-calibrated uncertainty. Latency requirements specify that "95th percentile response under 2 seconds" for acceptable performance. Fairness metrics ensure performance does not vary significantly across user groups.

I: Implement (Eval First)

Build evaluation infrastructure before implementing features. This includes creating evaluation datasets with ground truth, establishing automated evaluation pipelines (see the architecture patterns section for details), defining thresholds for acceptable performance, and building human evaluation workflows for edge cases that require human judgment.

D: Deliver (Rollout Based on Metrics)

Define explicit rollout criteria. Teams should determine what evaluation metrics must pass before launch, what percentage of users get the feature first, what rollback triggers are defined, and what monitoring is in place post-launch to catch any issues.

O: Operate (Monitor and Iterate)

AI products require ongoing monitoring. Teams should track performance metrics in production, monitor for distribution shift that indicates model drift, collect user feedback on AI quality, and iterate based on real-world performance rather than relying solely on pre-launch evaluation.

What's Next?

Next, we explore Eval-First PRDs, understanding how to write product requirements documents that start with evaluation criteria.