16.4 Structured Outputs and Tool Compatibility

"A model that can only generate text is a curiosity. A model that can generate valid JSON, call tools, and interact with external systems is a platform. Structured outputs transform language models from text generators into reliable components."
An Engineer Who Wasted Six Months on Regex Parsing

Introduction

Raw language model outputs are unstructured text. Production AI systems need structured outputs: JSON with specific schemas, function calls with typed arguments, and tool invocations that external systems can process. This section covers JSON mode, function calling, tool schema design, and reliability across models.

JSON Mode and Structured Generation

JSON mode instructs the model to generate output conforming to a specific JSON schema. This enables programmatic consumption of model outputs without fragile parsing logic.

How JSON Mode Works

When you request JSON mode, you provide a schema describing the expected output structure. The model generates JSON matching that schema, and the system validates the output. If validation fails, you can retry or fall back to a different approach.

+------------------------------------------------------------------+ | JSON MODE FLOW | +------------------------------------------------------------------+ | | | Request with Schema | | | | | v | | +------------------+ | | | Model generates | Constrained to schema | | | JSON output | | | +------------------+ | | | | | v | | +------------------+ | | | Validate JSON | | | +------------------+ | | | | | Valid|Invalid | | | | | Done |Retry/Fallback | | | +------------------------------------------------------------------+

JSON Schema Design

Design schemas that are strict enough to be reliable but flexible enough to accommodate model limitations. Overly strict schemas cause high retry rates; overly loose schemas fail to provide structure.

Best practices: Use required fields sparingly; prefer nullable optional fields over strict unions; use enum fields for known values; include descriptions to guide the model; avoid deeply nested structures when possible.

// Good schema design example
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "status": {
      "type": "string",
      "enum": ["success", "failure", "partial"],
      "description": "Overall extraction status"
    },
    "entities": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "type": { "type": "string", "enum": ["person", "organization", "date"] },
          "value": { "type": "string" },
          "confidence": { "type": "number", "minimum": 0, "maximum": 1 }
        },
        "required": ["type", "value"]
      }
    }
  },
  "required": ["status"]
}

Function Calling and Tool Use

Function calling (also called tool use) allows models to invoke external tools with structured arguments. This transforms models from text generators into interactive agents that can take actions in the world.

Function Calling Patterns

Zero-shot function calling: The model receives a list of available functions and selects which to call based on the request. The model reasons about intent and chooses appropriate tools.

Forced function calling: You specify which function must be called, and the model generates arguments for that function. Useful when user intent clearly maps to a specific tool.

Parallel function calling: The model generates multiple function calls simultaneously. Useful when a request requires multiple independent actions.

Pattern	Use Case	Reliability
Zero-shot	Open-ended tool use	Moderate (requires good function descriptions)
Forced	Intent is clear	High
Parallel	Independent actions	Moderate-High

Function Schema Design

Function schemas tell the model what a function does and what arguments it accepts. Well-designed schemas are critical for reliable function calling.

// Well-designed function schema
{
  "name": "extract_patient_info",
  "description": "Extract structured patient information from clinical notes. Use this when given a clinical document and asked to identify patient demographics, conditions, or treatments.",
  "parameters": {
    "type": "object",
    "properties": {
      "patient_name": {
        "type": "string",
        "description": "Patient full name or 'NOT_FOUND' if not mentioned"
      },
      "date_of_birth": {
        "type": "string",
        "description": "Patient DOB in ISO format or 'NOT_FOUND'"
      },
      "conditions": {
        "type": "array",
        "items": { "type": "string" },
        "description": "List of diagnosed conditions"
      }
    },
    "required": ["patient_name"]
  }
}

The Description is the Prompt

For function calling, the function description is as important as the parameters schema. Include clear guidance on when to use the function, what kind of requests it handles, and any edge cases to consider. Ambiguous descriptions lead to incorrect function selection.

Tool Schema Design Best Practices

Tool schemas define the interface between your AI system and external systems. Good tool schemas improve reliability and reduce misuse.

Schema Design Principles

Be specific about types: Use the most specific types possible. Instead of allowing any string for an enum-like field, use enum constraints.

Provide example values: Include examples in descriptions to guide the model toward valid outputs.

Handle missing values explicitly: Decide how the model should indicate missing or unknown values. Using explicit markers like "NOT_FOUND" is better than returning null or empty strings.

Limit complexity: Deeply nested parameters increase error rates. Flatten where possible, or break into multiple function calls.

Handling Tool Failures

Tools fail. Network errors, authentication issues, and service outages all happen. Design your system to handle tool failures gracefully.

Retry logic: Implement automatic retries with exponential backoff for transient failures.

Error propagation: When a tool fails, propagate the error to the model with context about what happened. The model can often work around failures or provide better error messages.

Fallback tools: Define fallback tools that provide degraded but functional alternatives when primary tools are unavailable.

Reliability Across Models

Different models have different reliability levels for structured outputs. A model that produces valid JSON 99% of the time may only produce valid function calls 85% of the time. Understanding these differences guides model selection for structured output tasks.

Reliability Benchmarks

Test each model on your specific structured output tasks to measure reliability. Do not assume that benchmark performance translates to your use case.

Measure: JSON validity rate (parses as valid JSON), schema conformance rate (conforms to your schema), functional correctness rate (output is semantically correct).

Model	JSON Validity	Schema Conformance	Functional Correctness
GPT-4o	99.5%	98.0%	97.5%
Claude 3.5	99.2%	97.5%	97.0%
GPT-4o-mini	98.0%	94.0%	92.0%
Llama 3.1 70B	95.0%	88.0%	85.0%

These numbers are illustrative. Your actual reliability will vary based on your schemas, prompts, and task types. Always test empirically.

Validation Layers

Never trust model outputs without validation. Even the most reliable models produce invalid outputs occasionally. Build validation layers that check outputs before passing them downstream.

Structural validation: Check that output parses as valid JSON and conforms to schema.

Semantic validation: Check that output values are reasonable and consistent with other known values.

Business logic validation: Check that output satisfies domain-specific constraints.

Validation is Not Optional

Every production system must validate structured outputs. Model outputs can be invalid for reasons ranging from simple token limits to complex reasoning errors. Without validation, invalid outputs propagate through your system causing errors, crashes, or data corruption downstream.

Practical Example: DataForge Structured Output Pipeline

Who: A legal tech startup processing contract documents

Situation: DataForge extracts structured information from contracts using function calling

Problem: Initial implementation had 12% failure rate due to invalid function calls and schema mismatches

Solution: The team implemented a multi-layer validation pipeline:

Layer 1: Structural validation rejects non-JSON and schema violations immediately. Layer 2: Semantic validation checks that extracted dates are reasonable and entities are consistent. Layer 3: Business logic validation ensures extracted clauses are internally consistent. Layer 4: Cross-document validation flags inconsistencies across multiple contract sections.

Result: Invalid output rate dropped from 12% to 0.3%. The remaining failures are flagged for human review. Human review queue decreased 85%, making the system economically viable.

Cross-References

For tool use in agentic architectures, see Chapter 15.4 Agentic Workflow Systems. For security considerations with tool calling, see Chapter 20 Security. For evaluation of structured outputs, see Chapter 24 Evaluation.

Section Summary

Structured outputs transform models from text generators into reliable system components. JSON mode constrains output to conform to schemas, enabling programmatic consumption. Function calling enables models to invoke external tools with typed arguments. Tool schema design requires specific types, clear descriptions, and explicit handling of missing values. Validation layers are essential because every model produces invalid outputs occasionally. Measure reliability empirically across your specific tasks and schemas, not just on benchmarks.