"A model that can only generate text is a curiosity. A model that can generate valid JSON, call tools, and interact with external systems is a platform. Structured outputs transform language models from text generators into reliable components."
An Engineer Who Wasted Six Months on Regex Parsing
Introduction
Raw language model outputs are unstructured text. Production AI systems need structured outputs: JSON with specific schemas, function calls with typed arguments, and tool invocations that external systems can process. This section covers JSON mode, function calling, tool schema design, and reliability across models.
JSON Mode and Structured Generation
JSON mode instructs the model to generate output conforming to a specific JSON schema. This enables programmatic consumption of model outputs without fragile parsing logic.
How JSON Mode Works
When you request JSON mode, you provide a schema describing the expected output structure. The model generates JSON matching that schema, and the system validates the output. If validation fails, you can retry or fall back to a different approach.
JSON Schema Design
Design schemas that are strict enough to be reliable but flexible enough to accommodate model limitations. Overly strict schemas cause high retry rates; overly loose schemas fail to provide structure.
Best practices: Use required fields sparingly; prefer nullable optional fields over strict unions; use enum fields for known values; include descriptions to guide the model; avoid deeply nested structures when possible.
Function Calling and Tool Use
Function calling (also called tool use) allows models to invoke external tools with structured arguments. This transforms models from text generators into interactive agents that can take actions in the world.
Function Calling Patterns
Zero-shot function calling: The model receives a list of available functions and selects which to call based on the request. The model reasons about intent and chooses appropriate tools.
Forced function calling: You specify which function must be called, and the model generates arguments for that function. Useful when user intent clearly maps to a specific tool.
Parallel function calling: The model generates multiple function calls simultaneously. Useful when a request requires multiple independent actions.
| Pattern | Use Case | Reliability |
|---|---|---|
| Zero-shot | Open-ended tool use | Moderate (requires good function descriptions) |
| Forced | Intent is clear | High |
| Parallel | Independent actions | Moderate-High |
Function Schema Design
Function schemas tell the model what a function does and what arguments it accepts. Well-designed schemas are critical for reliable function calling.
The Description is the Prompt
For function calling, the function description is as important as the parameters schema. Include clear guidance on when to use the function, what kind of requests it handles, and any edge cases to consider. Ambiguous descriptions lead to incorrect function selection.
Tool Schema Design Best Practices
Tool schemas define the interface between your AI system and external systems. Good tool schemas improve reliability and reduce misuse.
Schema Design Principles
Be specific about types: Use the most specific types possible. Instead of allowing any string for an enum-like field, use enum constraints.
Provide example values: Include examples in descriptions to guide the model toward valid outputs.
Handle missing values explicitly: Decide how the model should indicate missing or unknown values. Using explicit markers like "NOT_FOUND" is better than returning null or empty strings.
Limit complexity: Deeply nested parameters increase error rates. Flatten where possible, or break into multiple function calls.
Handling Tool Failures
Tools fail. Network errors, authentication issues, and service outages all happen. Design your system to handle tool failures gracefully.
Retry logic: Implement automatic retries with exponential backoff for transient failures.
Error propagation: When a tool fails, propagate the error to the model with context about what happened. The model can often work around failures or provide better error messages.
Fallback tools: Define fallback tools that provide degraded but functional alternatives when primary tools are unavailable.
Reliability Across Models
Different models have different reliability levels for structured outputs. A model that produces valid JSON 99% of the time may only produce valid function calls 85% of the time. Understanding these differences guides model selection for structured output tasks.
Reliability Benchmarks
Test each model on your specific structured output tasks to measure reliability. Do not assume that benchmark performance translates to your use case.
Measure: JSON validity rate (parses as valid JSON), schema conformance rate (conforms to your schema), functional correctness rate (output is semantically correct).
| Model | JSON Validity | Schema Conformance | Functional Correctness |
|---|---|---|---|
| GPT-4o | 99.5% | 98.0% | 97.5% |
| Claude 3.5 | 99.2% | 97.5% | 97.0% |
| GPT-4o-mini | 98.0% | 94.0% | 92.0% |
| Llama 3.1 70B | 95.0% | 88.0% | 85.0% |
These numbers are illustrative. Your actual reliability will vary based on your schemas, prompts, and task types. Always test empirically.
Validation Layers
Never trust model outputs without validation. Even the most reliable models produce invalid outputs occasionally. Build validation layers that check outputs before passing them downstream.
Structural validation: Check that output parses as valid JSON and conforms to schema.
Semantic validation: Check that output values are reasonable and consistent with other known values.
Business logic validation: Check that output satisfies domain-specific constraints.
Validation is Not Optional
Every production system must validate structured outputs. Model outputs can be invalid for reasons ranging from simple token limits to complex reasoning errors. Without validation, invalid outputs propagate through your system causing errors, crashes, or data corruption downstream.
Practical Example: DataForge Structured Output Pipeline
Who: A legal tech startup processing contract documents
Situation: DataForge extracts structured information from contracts using function calling
Problem: Initial implementation had 12% failure rate due to invalid function calls and schema mismatches
Solution: The team implemented a multi-layer validation pipeline:
Layer 1: Structural validation rejects non-JSON and schema violations immediately. Layer 2: Semantic validation checks that extracted dates are reasonable and entities are consistent. Layer 3: Business logic validation ensures extracted clauses are internally consistent. Layer 4: Cross-document validation flags inconsistencies across multiple contract sections.
Result: Invalid output rate dropped from 12% to 0.3%. The remaining failures are flagged for human review. Human review queue decreased 85%, making the system economically viable.
Cross-References
For tool use in agentic architectures, see Chapter 15.4 Agentic Workflow Systems. For security considerations with tool calling, see Chapter 20 Security. For evaluation of structured outputs, see Chapter 24 Evaluation.
Section Summary
Structured outputs transform models from text generators into reliable system components. JSON mode constrains output to conform to schemas, enabling programmatic consumption. Function calling enables models to invoke external tools with typed arguments. Tool schema design requires specific types, clear descriptions, and explicit handling of missing values. Validation layers are essential because every model produces invalid outputs occasionally. Measure reliability empirically across your specific tasks and schemas, not just on benchmarks.