"Not all evaluations are created equal. A task eval tells you if the dog caught the car. A system eval tells you if the car still works afterward."
A Systems Engineer With Parking Lot Trauma
A Taxonomy of Evaluation Types
Evaluation types serve different purposes and measure different things. Understanding this taxonomy helps you build comprehensive evaluation coverage for your AI product.
The Five Core Eval Types
Every AI product needs all five types of evaluation. They answer different questions and catch different failure modes. Skipping any one of them leaves a gap in your quality assurance.
Task Evals: Does It Complete the Task Correctly?
Task evaluations measure whether the AI system accomplishes a specific task correctly. The question is simple: given this input, does the output meet the success criteria?
Characteristics of Task Evals
Task evals are characterized by the presence of ground truth, meaning there exists a known correct answer or set of acceptable answers against which outputs can be measured. The test inputs in task evals are structured and can be specified precisely, allowing for consistent and repeatable testing. Additionally, the outputs are measurable, enabling success to be determined automatically without subjective judgment.
Task Eval Examples
Task evals appear in many forms across different applications. Classification accuracy evals measure whether a model correctly classifies emails as spam or not spam. Extraction precision evals assess whether a system can extract structured information such as the date, time, and location from a calendar invitation. Translation quality evals compare machine translations against human reference translations to measure closeness. Code generation evals verify whether generated code passes predefined unit tests.
Task Eval Design Principle
Task evals work when you can enumerate the test cases and define success criteria for each. When ground truth is unclear or when the task is subjective, you need other eval types.
System Evals: Does the Whole System Work Together?
System evaluations measure whether the components of your AI system work together correctly. A system might have task-level evals passing for each component while failing at the system level.
Characteristics of System Evals
System evals focus on component interaction, where multiple components must coordinate correctly to produce the desired outcome. They measure end-to-end behavior, treating the whole pipeline as the unit of evaluation rather than individual parts in isolation. System evals also capture emergent properties, which are behaviors that arise from the interaction of components and cannot be predicted by examining any single component alone.
System Eval Examples
System evals manifest in several common scenarios. RAG recall evals measure whether a retrieval plus generation pipeline can answer questions correctly by correctly retrieving relevant documents and generating accurate responses. Agent task completion evals assess whether an agent can use tools correctly to accomplish multi-step goals that require coordinating multiple operations. Conversational coherence evals evaluate whether a dialogue system maintains context across multiple turns, ensuring natural and consistent conversation flow.
Common System Eval Failure
Components that pass task evals in isolation can fail system evals when combined. The retrieval system might return relevant documents individually, but fail to return the right combination when the query requires synthesizing across multiple documents.
Agent Evals: Does the Agent Behave Correctly Over Time?
Agent evaluations measure whether an AI agent behaves correctly over extended interactions. This includes task completion, but also resource management, error recovery, and appropriate tool use.
Characteristics of Agent Evals
Agent evals capture long-horizon behavior, recognizing that an agent's actions have consequences that accumulate over time rather than producing immediate isolated results. They evaluate stateful interactions, where the agent maintains and updates state across multiple steps in a process. Agent evals also examine tool use patterns, assessing whether the agent selects and sequences tools correctly to accomplish its goals.
Agent Eval Metrics
Agent evals track several key metrics. Task completion rate measures what fraction of tasks the agent completes successfully. Step efficiency evaluates whether the agent takes a reasonable number of steps to accomplish goals, avoiding unnecessary complexity. Error recovery assesses whether the agent can recognize when something has gone wrong and take appropriate corrective action. Resource usage measures how efficiently the agent uses API calls, tokens, and time to accomplish its objectives.
Integration Evals: Do Components Work Together?
Integration evaluations verify that your AI system works correctly with external systems, APIs, data sources, and user interfaces. The question is whether the AI system fits into the broader ecosystem.
Integration Eval Dimensions
Integration evals examine multiple dimensions of compatibility with the broader ecosystem. API compatibility measures whether the system handles API changes gracefully without breaking existing functionality. Data pipeline integrity verifies that the system produces data in the expected format for downstream consumers. UI/UX interaction evaluates whether AI outputs render correctly in the interface and provide a good user experience. Latency requirements ensure that the system responds within acceptable time bounds to maintain a smooth user experience.
Integration Eval Timing
Run integration evals before deployment, not after. Catching a UI rendering issue or an API compatibility problem in testing is far less expensive than catching it in production.
Regression Evals: Did We Break Something?
Regression evaluations detect when changes to the system cause previously working behavior to degrade. They are the safety net that enables rapid iteration.
Regression Eval Design
Effective regression evals are designed to be fast, deterministic, comprehensive, and actionable. They must run quickly since they execute on every change, providing immediate feedback. They are deterministic, meaning the same change always produces the same result, enabling reliable detection of regressions. They are comprehensive, covering the behaviors that matter most to users and the business. When they fail, they are actionable, making it clear what broke and why so developers can address the issue promptly.
Setting Regression Thresholds
Regression thresholds define how much degradation is acceptable. Set them conservatively for critical behaviors and more leniently for edge cases.
Regression Threshold Strategy
For each behavior in your regression suite, begin by defining the baseline through measuring the behavior on your current production system. Then set the threshold, which is the maximum acceptable regression percentage. The specific thresholds depend on the criticality of the behavior: critical behaviors typically allow zero to five percent regression, important behaviors allow five to ten percent, and minor behaviors can tolerate ten to twenty percent. Set alerts to warn at fifty percent of the threshold and fail at the full threshold. Finally, monitor trends over time, recognizing that even small regressions accumulating over time can become significant.
Choosing the Right Eval Type
Use this decision framework to determine which eval type to use based on the question you need to answer. If you are asking whether a component produces correct output, use a task eval. If you need to know whether components work together correctly, use a system eval. When evaluating whether an agent behaves well over extended interactions, use an agent eval. If you need to determine whether the system fits into the broader ecosystem including APIs and user interfaces, use an integration eval. Finally, if you want to know whether your changes broke existing behavior, use a regression eval.
Layering Your Evals
The most robust evaluation strategy layers all five eval types. Task evals catch component failures. System evals catch integration issues. Agent evals catch long-horizon problems. Integration evals catch ecosystem issues. Regression evals catch everything new.