Ch 16: Models, Routing, and Capability Allocation

Processing a customer refund requires different capabilities than diagnosing a rare disease. One needs speed and cost efficiency; the other needs precision and reliability. Yet most AI systems route both requests through the same model, paying premium costs for the simple task while potentially underserving the complex one. Model routing, positioned in the Measure-Architect loop, is one of the highest-leverage optimizations in AI engineering: intelligent routing can cut costs by 60% while actually improving output quality.

The Tripartite Loop in Model Selection, Routing, and Fine-Tuning

Model selection and routing decisions activate all three disciplines: AI PM defines performance requirements, cost constraints, and quality thresholds that guide model choice; Vibe-Coding tests different models and routing strategies against real workloads to find the best performance per dollar; AI Engineering implements the selection logic, fallback chains, and monitoring that ensure the right model serves each request.

Chapter 16 opener illustration — Model selection and routing determine which AI capabilities serve which requests.

Vibe-Coding in Model Behavior Testing

Vibe-coding lets you test model behavior across different providers and versions before building routing infrastructure. Quickly compare how different models handle your specific task types, edge cases, and failure modes. Vibe-coding model variants reveals which models genuinely excel at your use cases versus which merely seem adequate in theory, enabling data-driven routing decisions rather than guesswork.

PM Decision Points in Model Routing

PM decisions shaped by model routing include: Which tasks require premium model quality versus cost efficiency? How do routing failures affect user experience? What latency guarantees are acceptable for different task types? PMs should define quality thresholds for different task criticality levels and establish budgets that align with user value. The 60% cost reduction from intelligent routing only materializes if product requirements clearly specify where quality can be sacrificed for speed and cost.

Objective: Master model selection, routing strategies, and capability-based allocation.

Chapter Overview

This chapter covers the engineering decisions that determine how models are selected, routed, and allocated to tasks. Model selection involves understanding open versus closed models, size versus capability trade-offs, and task-model matching. Model routers direct requests to appropriate models based on task requirements, cost constraints, and quality targets. Ensembles and specialization combine multiple models for better results than any single model. Structured outputs and tool compatibility enable reliable integration with external systems. The chapter concludes with latency, cost, and quality trade-offs that guide optimization priorities.

Four Questions This Chapter Answers

What are we trying to learn? How to match model capabilities to task requirements while optimizing for latency, cost, and quality trade-offs specific to our product.
What is the fastest prototype that could teach it? A routing experiment sending the same requests to different models and comparing results, cost, and latency.
What would count as success or failure? A routing strategy that consistently sends requests to the cheapest model that can handle them adequately.
What engineering consequence follows from the result? Model routing is a high-leverage optimization; intelligent routing can dramatically reduce costs while maintaining quality.

Learning Objectives

Select appropriate models for specific tasks and constraints
Design intelligent routing strategies for multi-model systems
Build ensembles and specialized models for improved quality
Implement structured outputs and tool calling reliably
Optimize for latency, cost, and quality trade-offs

Sections in This Chapter

16.1 Model Selection Fundamentals
16.2 Model Routers
16.3 Ensembles and Specialization
16.4 Structured Outputs and Tool Compatibility
16.5 Latency, Cost, and Quality Trade-offs