Part V: Evaluation, Reliability, and Governance
Chapter 24

Cost-Aware Routing

"Not every AI request needs GPT-4. Cost-aware routing matches request complexity to model capability, delivering the right quality at the right cost."

An ML Platform Engineer

The Model Routing Problem

Different requests have different complexity and quality requirements. A simple classification task may not need a frontier model's capabilities, while a complex reasoning task cannot be handled by a smaller model. Model routing solves the problem of matching requests to the most cost-effective model that can handle them.

The naive approach is to use the most capable model for everything. This works but is expensive. The smart approach is to route each request to the model that best fits its complexity.

The Routing Sweet Spot

Cost-aware routing finds the balance between quality and cost. Route too aggressively to cheap models and quality suffers. Route too conservatively and costs spiral. The right routing strategy depends on your quality requirements and cost constraints.

Routing Strategies

Rule-Based Routing

Use explicit rules to classify requests and select models:


class RuleBasedRouter:
    """
    Route requests based on explicit rules.
    Fast to implement, easy to audit.
    """
    def __init__(self):
        self.rules = [
            {
                "condition": lambda r: r.task_type == "classification",
                "model": "gpt-4o-mini",
                "confidence_boost": 0.1
            },
            {
                "condition": lambda r: r.task_type == "extraction" and r.field_count < 5,
                "model": "gpt-4o-mini",
                "confidence_boost": 0.05
            },
            {
                "condition": lambda r: r.task_type == "reasoning",
                "model": "gpt-4o",
                "confidence_boost": 0.0
            },
            {
                "condition": lambda r: r.domain in ["medical", "legal"],
                "model": "gpt-4o",
                "confidence_boost": 0.0
            }
        ]
    
    def route(self, request: Request) -> RoutingDecision:
        for rule in self.rules:
            if rule["condition"](request):
                return RoutingDecision(
                    model=rule["model"],
                    confidence_adjustment=rule["confidence_boost"],
                    routing_reason=self._explain_rule(rule)
                )
        
        # Default fallback
        return RoutingDecision(
            model="gpt-4o",
            confidence_adjustment=0.0,
            routing_reason="No matching rule, using default"
        )
        

ML-Based Routing

Train a classifier to predict which model will perform best:


class MLRouter:
    """
    Route requests using a trained classifier.
    Better at handling complex, nuanced routing decisions.
    """
    def __init__(self, model_path: str):
        self.classifier = load_model(model_path)
        self.models = ["gpt-4o-mini", "gpt-4o"]
    
    def predict_model(self, request: Request) -> tuple[str, float]:
        """
        Predict which model will perform best.
        Returns (model, confidence_score).
        """
        features = self._extract_features(request)
        prediction = self.classifier.predict(features)
        
        confidence = self.classifier.predict_proba(features).max()
        
        return self.models[prediction], confidence
    
    def _extract_features(self, request: Request) -> dict:
        return {
            "task_complexity": estimate_complexity(request),
            "token_count": len(request.prompt),
            "has_reasoning": contains_reasoning(request.prompt),
            "domain_risk": domain_risk_score(request.domain),
            "output_length_expected": estimate_output_length(request),
            "conversation_turns": len(request.history),
        }
    
    def route(self, request: Request) -> RoutingDecision:
        model, confidence = self.predict_model(request)
        
        return RoutingDecision(
            model=model,
            confidence=confidence,
            routing_reason=f"ML classifier: {model} at {confidence:.2f} confidence"
        )
        

Bandit-Based Routing

Treat routing as a multi-armed bandit problem and learn from outcomes:


class BanditRouter:
    """
    Route using contextual bandits that learn from outcomes.
    Balances exploration of new routing strategies with exploitation of known good strategies.
    """
    def __init__(self, models: list[str], epsilon: float = 0.1):
        self.models = models
        self.epsilon = epsilon  # Exploration rate
        self.rewards = {m: [] for m in models}  # Rolling reward history
    
    def select_model(self, context: dict) -> str:
        # Exploration: try random model
        if random.random() < self.epsilon:
            return random.choice(self.models)
        
        # Exploitation: choose model with highest expected reward
        expected_rewards = {
            m: self._expected_reward(m) for m in self.models
        }
        return max(expected_rewards, key=expected_rewards.get)
    
    def _expected_reward(self, model: str) -> float:
        """Calculate expected reward based on historical performance."""
        if not self.rewards[model]:
            return 0.5  # Prior
        
        # Use exponential moving average
        rewards = self.rewards[model]
        alpha = 0.3
        ema = rewards[0]
        for r in rewards[1:]:
            ema = alpha * r + (1 - alpha) * ema
        return ema
    
    def record_outcome(self, model: str, quality_score: float, latency_ms: float):
        """Record outcome for learning."""
        # Reward = quality / (cost factor) - latency penalty
        cost_factor = 1.0  # Normalize by model cost
        reward = quality_score / cost_factor - (latency_ms / 10000)
        
        self.rewards[model].append(max(0, reward))
        
        # Keep rolling window of last 100 outcomes
        if len(self.rewards[model]) > 100:
            self.rewards[model] = self.rewards[model][-100:]
        

Routing Accuracy Matters

Routing errors have asymmetric costs. Routing a complex task to a weak model causes quality failure and requires retry. Routing a simple task to a strong model wastes money but still succeeds. Account for this asymmetry when designing routing strategies.

Quality Gates

Every routing strategy needs a quality gate that catches routing failures:


class QualityGatedRouter:
    """
    Router with quality verification that escalates on failure.
    """
    def __init__(
        self,
        base_router: Router,
        quality_evaluator: QualityEvaluator,
        max_retries: int = 2
    ):
        self.router = base_router
        self.evaluator = quality_evaluator
        self.max_retries = max_retries
    
    async def route(self, request: Request) -> RoutingDecision:
        # Initial routing
        decision = self.router.route(request)
        
        # Verify quality
        result = await self._execute_and_verify(request, decision.model)
        
        if result.quality_ok:
            return decision
        
        # Retry with stronger model if quality fails
        for attempt in range(self.max_retries):
            stronger_model = self._get_stronger_model(decision.model)
            
            result = await self._execute_and_verify(request, stronger_model)
            
            if result.quality_ok:
                return RoutingDecision(
                    model=stronger_model,
                    confidence=0.5,  # Reduced confidence due to retry
                    routing_reason=f"Upgraded from {decision.model} after quality check"
                )
        
        # Final fallback to best model
        return RoutingDecision(
            model="gpt-4o",
            confidence=0.9,
            routing_reason="Quality gate: forced to best model"
        )
        

Practical Example: QuickShip Customer Support Routing

The QuickShip team was implementing cost-aware routing for their support AI after discovering that eighty percent of support requests were being routed to gpt-4o at high cost. Simple queries like "where is my package" were costing two cents each when gpt-4o-mini could handle them for $0.0002, creating a dilemma about how to route accurately without degrading customer experience.

The team decided to implement ML-based routing with quality gates to balance cost savings with service quality. They trained a classifier on ten thousand labeled support tickets to predict which model was appropriate for each query. The features used included query complexity, customer tier, and time sensitivity. They implemented a quality gate with a ninety-five percent accuracy threshold on gpt-4o-mini responses, and when the quality gate fails, the request escalates to gpt-4o.

After implementation, seventy-five percent of requests now use gpt-4o-mini while the quality gate catches the twenty-five percent that need gpt-4o. This achieved a net cost reduction of sixty-eight percent while maintaining customer satisfaction scores. The lesson is that quality gates are essential for conservative routing. They prevent cost savings from coming at the expense of customer experience.

Routing Analytics

Key Routing Metrics

Track key routing metrics to evaluate whether your routing strategy is achieving its goals. Routing accuracy measures the percentage of requests correctly routed to the appropriate model, with a target above ninety percent. Cost savings measures the cost compared to always using the best model, with a target above fifty percent savings. Quality retention measures quality compared to always using the best model, with a target above ninety-eight percent to ensure minimal quality degradation. Escalation rate measures the percentage of requests requiring upgrade to a stronger model, with a target below twenty-five percent to ensure the routing strategy is not overly conservative.

Analyzing Routing Decisions

Log routing decisions to enable analysis:


@dataclass
class RoutingDecisionLog:
    request_id: str
    timestamp: datetime
    request_features: dict
    selected_model: str
    alternative_models: list[str]
    confidence: float
    quality_score: float | None
    cost_saved: float
    was_escalated: bool
        

Research Frontier

Research on "speculative routing" explores using small, fast models to predict whether larger models are needed. By running both in parallel for uncertain cases, speculative routing can achieve both low latency and high quality for complex queries.