"A router is the traffic cop of your AI system. Done well, it directs each request to exactly the right model. Done poorly, it either overwhelms your budget with unnecessary capability or degrades quality through underprovisioning."
A Platform Engineer Who Watches Costs Like a Hawk
Introduction
A model router decides which model handles each incoming request. The router forms the core of any capability-based allocation system. A well-designed router analyzes incoming requests and routes them to the most appropriate model based on task requirements, cost constraints, and quality targets.
This section covers routing strategies from simple rule-based approaches to sophisticated learned routers, cost-aware routing, quality-latency trade-offs, and implementation patterns.
Routing Strategies
Routing strategies range from trivial to complex. The right strategy depends on how varied your request distribution is and how much quality variation you can tolerate.
Rule-Based Routing
The simplest routing approach uses deterministic rules based on request attributes. Common attributes include:
Request length: Short prompts (under 500 tokens) can often be handled by smaller models; longer prompts may require larger context windows or more capable models.
Task type: Classify requests by intent (extraction, summarization, generation, reasoning) and route based on task-model mappings you have empirically validated.
User tier: Free users get smaller models; paid users get larger models. This business logic is straightforward to implement but must be communicated clearly.
Content flags: Requests containing code, math, or specific domains may be routed to models known to excel in those areas.
Classifier-Based Routing
For more nuanced routing, train a classifier to predict which model will perform best for each request. The classifier can use request features, embedding similarity to training examples, or learned representations.
Advantages: Handles edge cases better than rules, can learn non-obvious patterns, improves with more data.
Drawbacks: Requires training data (request-model-quality triplets), adds latency for classification, may behave unpredictably on out-of-distribution requests.
LLM-Based Routing
Use an LLM itself to classify requests and recommend routing. A small, fast model can analyze the request and output routing metadata. This approach is flexible and handles diverse request distributions well.
Cost-Aware Routing
Cost-aware routing explicitly considers the cost of each model when making routing decisions. The goal is to minimize cost while maintaining quality above a threshold.
Cost-Performance Ratios
Calculate the cost-performance ratio for each model on each task. This ratio tells you how much quality you get per dollar spent.
A model with higher absolute quality but worse cost-performance may not be the right choice unless quality differences are meaningful for your use case.
Minimum cost routing selects the cheapest model that meets a hard quality floor. This strategy works when you have a clear quality threshold that must not be crossed, and you are willing to accept whatever quality the minimum-cost model delivers above that floor. The trade-off is that this approach may sacrifice quality in pursuit of savings, so it is only appropriate when quality differences between models are acceptable for your use case.
Maximum quality routing always selects the most capable model regardless of cost. This strategy makes sense when quality is the only thing that matters and budget is not a constraint. The trade-off is that it ignores cost differences entirely, potentially spending significantly more than necessary when a cheaper model would have delivered adequate results.
Quality-adjusted cost routing optimizes for the best quality per dollar spent. This requires measuring quality systematically to compute cost-performance ratios, then routing to the model that maximizes quality while controlling cost. The trade-off is that it requires establishing robust quality measurement before it can work effectively.
Budget-constrained routing operates within a fixed AI budget, selecting models that deliver the best quality possible within that spending limit. This approach is common for products with predictable AI costs built into their unit economics. The trade-off is that the fixed budget may not cover all quality needs, requiring careful priority setting or acceptance of quality compromises.
Bidirectional Routing
For requests where quality is uncertain, use bidirectional routing: try a smaller model first, evaluate the output, and escalate to a larger model only if quality is insufficient. This approach works well for tasks where you can cheaply evaluate outputs (e.g., structural validation, format checking).
The Escalation Pattern
Bidirectional routing implements an escalation pattern: start with the cheapest model, check if output meets quality bar, escalate if not. This pattern is especially effective for extraction, classification, and transformation tasks where you can cheaply validate outputs. For generation tasks, escalation is harder because quality assessment requires human judgment.
Quality-Latency Trade-offs
Different routing strategies affect both quality and latency. Understanding these trade-offs helps you design routers that meet your specific requirements.
Fast-Path Architecture
Design your router to have a fast path for common cases. Most request distributions are heavy-tailed: a small number of task types dominate. Identify these common cases and hard-code fast routing for them.
For the long tail of requests, use more sophisticated (but slower) classification. This approach keeps average latency low while handling diversity.
Cascaded Routing
Cascaded routing tries models in sequence from cheapest to most expensive, stopping when output meets quality threshold. This approach minimizes cost but adds latency when escalation is needed.
The key to cascaded routing is fast quality assessment. Build cheap evaluation functions that can quickly determine if an output is sufficient. Without fast assessment, cascaded routing introduces unacceptable latency.
Parallel Routing
For latency-critical applications, route to multiple models simultaneously and return the first acceptable response. This approach minimizes latency but wastes compute on rejected outputs.
Parallel routing works well when the cost of waiting outweighs the cost of wasted compute, and when you can afford to sometimes pay for multiple models.
Implementation Patterns
Implementing a router involves decisions about where routing happens, how routing state is managed, and how routing rules are updated.
Routing in API Gateway
The simplest implementation is routing at the API gateway layer. Incoming requests are enriched with routing metadata before being forwarded to the appropriate model endpoint.
Learning to Route
For more sophisticated routing, collect data on request-model-quality triplets and train a learned model. The training objective is to predict which model will achieve the best quality-cost ratio for each request.
Start with a rule-based router, collect data, then gradually shift to learned routing as your data grows. This approach reduces risk during the learning phase.
Who: A legal tech startup building a document analysis platform
Situation: DataForge processes thousands of legal documents daily, ranging from simple contracts to complex regulatory filings
Problem: Routing all documents to GPT-4 for analysis cost $50,000/month, unsustainable for the business model
Solution: The team implemented a three-tier router:
Tier 1 (Fast path): Simple contracts with standard structure are routed to a fine-tuned Llama model (cost: $0.001/page). Tier 2 (Medium path): Complex contracts with non-standard clauses are routed to Claude 3 Haiku (cost: $0.01/page). Tier 3 (Slow path): Regulatory filings requiring nuanced interpretation are routed to GPT-4 (cost: $0.10/page).
Result: Average cost per document dropped from $0.08 to $0.015, a 81% reduction. Quality monitoring showed 97% of outputs met acceptance thresholds, with the 3% failures concentrated in novel document types now flagged for human review.
Monitoring Router Performance
A router is only as good as its feedback loop. Monitor routing decisions to identify systematic quality issues, cost overruns, and opportunities for optimization.
Track these metrics per route: quality scores, latency percentiles, cost per request, and escalation rates. A route with high escalation rates may indicate the quality threshold is set too low or the smaller model is underperforming for that task type.
Watch for Router Degradation
Model capabilities change over time. A model that was adequate for a task six months ago may be inadequate today as task complexity increases or as the model is updated. Re-evaluate routing rules when model versions change or when you observe quality drift.
Cross-References
For architecture patterns that integrate with routing, see Chapter 15.1 Architecture Spectrum. For evaluation frameworks to measure routing quality, see Chapter 24 Evaluation. For cost monitoring and optimization, see Section 16.5 Latency, Cost, and Quality Trade-offs.
Section Summary
Model routers direct requests to appropriate models based on task requirements, cost constraints, and quality targets. Rule-based routing works well for simple distributions; classifier-based routing handles more complexity. Cost-aware routing explicitly optimizes quality per dollar. Cascaded routing minimizes cost through escalation; parallel routing minimizes latency through simultaneous requests. Router performance must be monitored continuously, and routing rules should evolve as models and tasks change.