Section 22.4: Error Taxonomy and Postmortems for AI Failures

"Every AI failure is a gift if you learn from it. The team that conducts thorough AI postmortems builds immune systems against future failures."
A Head of AI Who Has Seen Incidents

Why AI Postmortems Are Different

Traditional software postmortems focus on code defects, infrastructure failures, and human errors with clear causality. AI postmortems must grapple with probabilistic behavior, emergent failures from component interactions, and the challenge that the same input can produce different outputs over time.

An AI failure often has no single root cause in the traditional sense. It emerges from the interaction of model behavior, prompt design, retrieval quality, and user expectations. Effective AI postmortems must map this complex causality without falsely reducing it to a single factor.

The AI Failure Challenge

When a traditional system fails, the code points to what went wrong. When an AI system fails, the code often works exactly as written. The failure is in the behavior that emerges from the interaction of components. This requires a different debugging and postmortem approach.

Error Taxonomy for AI Systems

Classifying AI errors by type enables targeted fixes. A quality taxonomy also supports aggregate analysis: understanding which error types are most common, most impactful, and most cost-effective to fix.

Generation Errors

Generation errors are errors in the content produced by the AI that cause incorrect or problematic outputs. Hallucination occurs when the model makes a confident false statement, such as claiming a law was passed that does not exist, typically caused by missing or conflicting context and overconfident generation. Fact errors are incorrect factual statements like stating the wrong date for a historical event, caused by training data staleness or retrieval gaps. Reasoning errors involve logical flaws in the model's reasoning, such as incorrectly concluding that C is greater than A when A is greater than B and B is greater than C, often due to model capability limitations or prompt framing issues. Format errors occur when output does not match required formats, such as JSON missing required fields, typically caused by prompt ambiguity or output validation gaps. Consistency errors happen when the model contradicts itself within the same response, such as saying yes in one paragraph and no in another, often caused by context length leading to attention drift.

Retrieval Errors

Retrieval errors are errors in the retrieval component of RAG systems that prevent the right context from being available to the model. Miss errors occur when relevant documents are not retrieved, such as when a drug interaction warning is missed, caused by embedding gaps, chunk boundary issues, or index coverage problems. False positive errors happen when irrelevant documents are retrieved, like a contract clause being retrieved for an unrelated query, caused by semantic similarity without actual relevance. Rank degradation occurs when relevant documents are retrieved but ranked low, such as when the correct answer is in context but the model ignores it, caused by ranking model weakness or context length bias. Staleness errors happen when outdated documents are retrieved, like old pricing being returned as current prices, caused by indexes not being updated or metadata filter errors.

Task Errors

Task errors are errors in task decomposition and execution that cause agents to fail at multi-step processes. Misuse errors occur when the wrong tool is selected for the task, such as using a calculator instead of a database lookup, caused by tool selection reasoning failure. Tool error propagation happens when a tool failure cascades to affect the output, such as when a weather API timeout causes an unrelated response, caused by missing error handling in the tool chain. State confusion occurs when the agent loses track of task progress, such as re-asking for information already provided, caused by context management failure. Escalation failure happens when the agent does not escalate to human review when it should, such as providing medical advice instead of deferring to a professional, caused by safety instruction weakness or capability miscalibration.

The Taxonomy Enables Prioritization

Not all errors are equal. Hallucinations in a customer-facing chatbot are more impactful than format errors. A good taxonomy lets you prioritize fixes by error frequency, user impact, and fixability. Use the taxonomy to build your error dashboard before trying to fix everything.

AI Postmortem Process

Data Collection

Before any analysis, collect all relevant data to understand what happened. The full prompt includes the exact prompt that produced the failure so you can reproduce and analyze the issue. Retrieval results include documents retrieved for context to assess whether the right information was available. Model output is the complete model response to see exactly what the system produced. Trace data provides timing and span information for performance analysis. User context captures what the user was trying to accomplish to understand the intent behind the request. System state includes model version and configuration at the time of failure to identify whether changes to the system contributed to the issue.

Failure Classification

Classify the failure using your error taxonomy, recognizing that more than one category may apply to a single incident. Begin by identifying generation errors in the output such as hallucinations, fact errors, or format errors. Identify retrieval errors in context including misses, false positives, and staleness issues. Identify task errors in execution flow including tool misuse and escalation failures. Determine which error type is the primary root cause that initiated the failure chain. Identify contributing factors even if they are not the root cause, as these often represent opportunities for defense-in-depth improvements.

Root Cause Analysis

For AI failures, root cause analysis must go beyond immediate symptoms:

The Five Whys for AI Failures

Apply the Five Whys framework adapted for AI systems by tracing causality through the failure chain. The first why asks why the user received a wrong answer, to which the answer is that the model hallucinated. The second why asks why the model hallucinated, to which the answer is that the context did not contain the correct information. The third why asks why the correct information was not in context, to which the answer is that retrieval missed the relevant document. The fourth why asks why retrieval missed the document, to which the answer is that chunking split relevant content across chunks. The fifth why asks why chunking was not optimized for this content type, to which the answer is that chunking strategy was set globally without content-type customization. This final answer reveals the systemic issue that, if fixed, would have prevented the entire failure chain.

Postmortem Template for AI Failures

Incident Summary


## AI Incident Postmortem

**Incident ID**: INC-2024-0423  
**Date**: April 23, 2024  
**Severity**: High  
**Duration**: 45 minutes  
**Users Affected**: ~2,300 users received incorrect responses

### Summary
The AI customer support chatbot provided incorrect warranty expiration 
dates for products purchased in 2023. The error lasted 45 minutes 
before detection and affected approximately 2,300 users.

### Error Classification
- **Primary**: Fact Error - Stale information in retrieval context
- **Secondary**: Retrieval Miss - Index coverage gap for 2023 purchases

Timeline


### Timeline (All times UTC)

10:32 - Product database updated with 2023 purchase records
10:45 - Scheduled re-indexing job started (runs every 24 hours)
11:20 - Re-indexing job failed silently due to timeout
       (Subsequent analysis: timeout was 5 minutes, job needed 8)

Next Day 10:45 - New re-indexing job started
11:23 - Re-indexing completed successfully

11:30 - User reports: "Chatbot said my warranty expired in 2023 
        but I bought it in 2024"

11:45 - Incident declared, investigation started
12:15 - Root cause identified: yesterday's indexing job failed
12:30 - Manual re-indexing triggered
12:35 - Service restored

Contributing Factors


### Contributing Factors

1. **Silent Indexing Failure**: The indexing job logged errors to a 
   file that was not monitored. No alert was configured.

2. **Confidence Without Verification**: The system returned high-
   confidence answers even though context was stale. The model 
   generated plausible-sounding but incorrect dates.

3. **No Freshness Monitoring**: No metric tracked retrieval staleness 
   or document freshness in the vector index.

4. **Slow Detection**: User report was the first indication of failure.
   No automated detection caught the mismatch before users saw it.

Action Items


### Action Items

| Action | Owner | Priority | Status |
|--------|-------|----------|--------|
| Add alerting for indexing job failures | Infrastructure | P0 | Done |
| Implement document freshness monitoring | Observability | P0 | Done |
| Add verification query to check context freshness | RAG Team | P1 | Done |
| Create synthetic monitoring with known answers | QA | P1 | In Progress |
| Implement automatic re-indexing trigger on source update | Infrastructure | P2 | Planned |

Practical Example: QuickShip Incident Postmortem

QuickShip conducted an incident review following a route optimization failure where the AI routing system sent twelve drivers through a closed road, causing three-hour delays for customers. The problem was that the road closure existed in the database but was not reflected in the AI's routing decisions, creating a dilemma about how to assign responsibility when multiple component failures contributed.

Analysis revealed four independent failures that aligned to cause the incident: the road closure database was updated correctly, but the indexing pipeline was delayed by six hours, so the route AI queried a stale index that did not reflect the closure. Additionally, the AI had no confidence threshold to flag uncertain routes for human review before taking action.

The fixes addressed each failure point: they added alerting for index freshness to catch delays, implemented automatic human review for routes with stale data, and added uncertainty thresholds for critical routing decisions to escalate when the system lacks confidence. The lesson is that this incident had no single root cause. It emerged from four independent failures that happened to align. Robust AI systems need defense-in-depth, not single-point fixes.

Error Prevention Strategies

Defense in Depth

No single guardrail prevents all failures, so build layered defenses that provide protection at multiple levels. Input validation catches malformed requests before they enter the system, preventing bad inputs from causing downstream failures. Retrieval verification confirms that context is present and fresh before relying on it for generation. Output validation checks responses against expected formats and ranges to catch errors before they reach users. Uncertainty thresholds trigger escalation when the system's confidence is low, ensuring that ambiguous situations receive human attention. Human review gates require human approval for high-stakes actions, adding a final layer of protection for decisions that could cause significant harm.

Error Budgets for AI Quality

Define acceptable error rates for each error type. This lets you balance reliability investment against feature velocity. When error rates exceed budget, reliability work takes priority over new features.

Research Frontier

New research explores "failure mode evolution" tracking how AI failure patterns change as models are updated. By detecting shifts in error taxonomy distributions, teams can proactively identify where new failure modes are emerging before they impact users.