16.5 Latency, Cost, and Quality Trade-offs

"Every AI product operates in a triangle of constraints: latency, cost, and quality. You can optimize any two, but the third will suffer. Understanding which constraint is negotiable for each request is the art of AI engineering."
A Platform Architect Who Has Made Every Mistake

Introduction

AI products are fundamentally constrained by three resources: latency (how fast responses arrive), cost (how much each request costs), and quality (how good the outputs are). These constraints are in tension: reducing latency often increases cost; reducing cost often degrades quality; improving quality often increases both latency and cost.

This section covers real-time versus batch trade-offs, caching strategies, quality degradation options, and unit economics for AI products.

Real-Time Versus Batch Trade-offs

Not every request needs an immediate response. Understanding when real-time processing is necessary versus when batch processing suffices is a fundamental architectural decision.

Real-Time Processing

Real-time processing returns responses immediately, synchronously. Users wait for the response before proceeding. This pattern is essential for interactive applications: chatbots, coding assistants, document editors.

When required: Interactive applications where latency directly affects user experience, any case where users are waiting for results, any case where downstream actions depend on the response.

Trade-offs: Real-time processing cannot take advantage of batch optimization. You pay premium for immediacy.

Batch Processing

Batch processing accumulates requests and processes them together. This enables optimization: sharing prompt templates across requests, reusing computed context, and scheduling during off-peak hours.

When appropriate: Non-interactive use cases, report generation, bulk document processing, any case where results are needed but immediacy is not required.

Trade-offs: Batch processing introduces delay. For some use cases, this delay is acceptable; for others, it is not.

Hybrid Approaches

Many applications benefit from hybrid approaches. Provide fast, lower-quality responses immediately, then refine in the background. This pattern is common in search: show results instantly, then improve ranking as more computation becomes available.

Streaming: Return partial results as they become available rather than waiting for complete generation. This reduces perceived latency even when actual processing time is unchanged.

Caching Strategies

Caching eliminates redundant computation by storing and reusing previous results. Effective caching dramatically reduces cost and latency for repeat requests.

Caching by Request

The simplest caching strategy stores results keyed by request hash. Identical requests return cached results immediately without model invocation.

Best for: Requests with high repetition (FAQs, common queries), scenarios where exact duplicates are common.

Limitation: Most production requests are not exact duplicates. Even small variations in phrasing produce different outputs.

// Request hash caching
async function cachedComplete(prompt, options) {
  const cacheKey = hash(prompt + JSON.stringify(options));
  
  const cached = await cache.get(cacheKey);
  if (cached) {
    metrics.recordCacheHit();
    return cached;
  }
  
  metrics.recordCacheMiss();
  const result = await model.complete(prompt, options);
  await cache.set(cacheKey, result, { ttl: 3600 });
  return result;
}

Semantic Caching

Semantic caching stores embeddings of requests and returns cached results for semantically similar requests. This dramatically increases cache hit rates compared to exact matching.

How it works: Compute embedding for incoming request, find nearest cached embedding above similarity threshold, return cached result if found.

Best for: Any request distribution with semantic similarity, long-tail distributions where exact duplicates are rare.

// Semantic caching
async function semanticCachedComplete(prompt, options) {
  const embedding = await embedder.embed(prompt);
  
  const cached = await semanticCache.lookup(embedding, {
    threshold: 0.95,
    maxResults: 1
  });
  
  if (cached) {
    metrics.recordCacheHit();
    return { ...cached.result, cacheHit: true };
  }
  
  metrics.recordCacheMiss();
  const result = await model.complete(prompt, options);
  
  await semanticCache.store(embedding, result, { ttl: 86400 });
  return { ...result, cacheHit: false };
}

Cache Invalidation

Caches must be invalidated when underlying data changes. Without proper invalidation, stale cached results reflect outdated information.

Time-based: Expire cached entries after a fixed duration. Simple but may serve stale content before expiration.

Event-based: Invalidate cache entries when relevant data changes. More complex but ensures freshness.

Version-based: Include version identifiers in cache keys. When versions change, old cache entries become effectively invalid.

The Cache Hit Rate Illusion

High cache hit rates are not always good. If your cache is too aggressive, semantically different requests may receive inappropriate cached results. Monitor not just hit rates but quality of cached results. If cached results are frequently overridden or rejected, your cache threshold is too loose.

Quality Degradation Options

When resources are constrained, you can deliberately degrade quality in controlled ways. The key is understanding which degradations are acceptable for your use case.

Model Downgrade

Route to smaller models when quality constraints allow. This is the most direct way to reduce cost and often reduces latency as well.

When acceptable: Low-stakes tasks, internal tools, cases where users are not paying premium for quality.

When not acceptable: High-stakes decisions, customer-facing outputs where quality reflects on your brand.

Context Truncation

Reduce the context sent to models by truncating long inputs. Smaller inputs cost less and generate faster.

When acceptable: When the truncated portion is not critical to the task, when models can reason effectively with limited context.

When not acceptable: Tasks requiring full document understanding, cases where key information may be in any part of the document.

Output Length Limits

Limit the length of generated outputs to reduce cost and latency.

When acceptable: Summaries, brief responses, cases where conciseness is acceptable.

When not acceptable: Detailed analysis, comprehensive outputs, cases where completeness is required.

Unit Economics

Unit economics tracks the cost and revenue per unit of value delivered. For AI products, understanding unit economics guides pricing, architecture, and optimization priorities.

Cost Components

AI product costs have multiple components:

Token costs: Input and output tokens processed by models. This is often the largest cost component but is not the only one.

Infrastructure costs: Compute, storage, networking for serving infrastructure. These scale differently than token costs.

Engineering costs: Development and maintenance of the AI system. These are fixed costs that amortize over volume.

Evaluation costs: Human evaluation, automated testing, quality monitoring. Often overlooked but essential.

Cost Component	How to Reduce	Risk of Reduction
Token costs	Smaller models, caching, batching	Quality degradation
Infrastructure	Right-sizing, spot instances, regions	Reliability, latency
Engineering	Reduce features, simplify architecture	Technical debt, maintainability
Evaluation	Reduce testing, automate more	Quality blind spots

Optimization Priorities

Focus optimization efforts where they have the most impact. Use cost drivers analysis to identify where changes have the largest effect on unit economics.

High-leverage optimizations: Switching to smaller models for appropriate tasks, improving cache hit rates, eliminating unnecessary context.

Low-leverage optimizations: Negotiating model provider rates (marginal gains), micro-optimizations in serving infrastructure.

Practical Example: HealthMetrics Unit Economics

Who: A healthcare analytics startup with three product tiers

Situation: HealthMetrics charges per-query pricing but costs were higher than revenue for most customers

Analysis: The team analyzed cost drivers across 100,000 requests:

Token costs: 65% of total cost. Cache hit rate: 23% (much lower than target). Context inflation: Average 40% of input tokens were unnecessary context from long conversation histories. Model selection: 45% of requests routed to GPT-4 when smaller models would suffice.

Interventions: (1) Implemented semantic caching, hit rate improved to 71%. (2) Added context truncation for conversations over 10 turns. (3) Retuned router to be more aggressive with smaller models. (4) Added usage tiers that matched model selection to customer tier.

Result: Cost per query decreased 68%. Unit economics flipped from negative to positive at current pricing. Free tier still negative but subsidised by paid tiers.

Pricing Implications

Understanding unit economics directly informs pricing strategy. Do not price based on costs alone, but costs set the floor below which you lose money.

Cost-plus pricing: Price above unit cost by target margin. Simple but may be above customer willingness to pay.

Value-based pricing: Price based on value delivered to customer, not cost to serve. Higher margins for high-value use cases.

Tiered pricing: Different tiers for different usage levels or quality requirements. Allows capturing more surplus while serving price-sensitive segments.

The Margin Cascade

AI product margins cascade through the system. A 10% improvement in cache hit rates may improve margins by more than 10% because it affects every request. A 10% improvement in routing accuracy may improve margins by 20% if it consistently moves expensive requests to cheaper models. Focus on the leverage points, not the obvious costs.

Cross-References

For caching in RAG systems, see Chapter 17.1 Vector Databases and Retrieval. For routing strategies, see Section 16.2 Model Routers. For evaluation of cost-quality trade-offs, see Chapter 24 Evaluation.

Section Summary

Latency, cost, and quality form a trade-off triangle. Real-time processing is required for interactive applications; batch processing enables optimization for non-interactive cases. Caching dramatically reduces cost and latency for repeat and semantically similar requests. Quality can be deliberately degraded through model downgrade, context truncation, or output limits. Unit economics analysis reveals high-leverage optimization opportunities. Pricing should account for but not be constrained by unit economics.