Structured outputs vs prompt-based JSON extraction - when do you actually need constrained decoding?

Every major LLM provider now supports some form of structured output. OpenAI has response_format: { type: "json_schema" }, Anthropic shipped constrained decoding for Claude, and Google’s Gemini has response_schema. On the open-source side, llama.cpp has GBNF grammars and frameworks like Outlines and XGrammar handle it at the inference level.

But I’m finding it hard to decide when to actually use these features versus just prompting the model for JSON and parsing with a retry loop.

Here’s my situation: I’m building a data extraction pipeline that pulls structured fields from unstructured documents (invoices, contracts, support tickets). Right now I’m using Claude with a detailed prompt that says “respond in this JSON format” plus Pydantic validation on the output. It works about 95% of the time, and when it fails I just retry.

A few things I’m trying to work through:

  1. Reliability vs latency: Constrained decoding guarantees valid JSON every time, but does it add meaningful latency? Has anyone benchmarked the overhead on high-volume workloads?

  2. Schema complexity: My schemas aren’t trivial. Nested objects, optional fields, arrays of unions. Do the constrained decoding implementations handle complex JSON Schema features well, or do they choke on things like oneOf and $ref?

  3. Quality tradeoffs: I’ve read that forcing the model into a rigid output schema can sometimes hurt the quality of the content inside the fields. Like the model spends so much “effort” on structure that the actual extracted values get worse. Anyone seen this in practice?

  4. Open-source options: For self-hosted models, is XGrammar the go-to now? I’ve seen benchmarks showing near-zero overhead but haven’t tried it myself.

Curious what stack people are using for production structured extraction and whether the native provider features are worth adopting over the “prompt and pray” approach.


Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!

I’ve been running structured extraction in production for about eight months now, so I can share some real numbers on this.

Use constrained decoding when your pipeline can’t tolerate retries. That’s the short answer. If you’re processing thousands of documents in a batch job and a 5% retry rate means burning extra tokens and adding latency to your queue, structured outputs pay for themselves immediately. In our invoice processing pipeline, switching from prompt-based JSON to Claude’s constrained decoding dropped our error-handling code by about 60% and made the whole thing way more predictable.

For your specific questions:

1. Latency: In practice, the overhead is negligible for most workloads. We’re doing roughly 2,000 extractions per hour and the P99 latency difference between constrained and unconstrained was under 50ms. The bigger win is eliminating retry latency, which was averaging 1.5-2 seconds per failed attempt.

2. Schema complexity: This is where it gets tricky. OpenAI’s implementation handles $ref and nested objects fine, but oneOf support is still rough across the board. What I’ve found works best is flattening your unions into a single object with a discriminator field and making the irrelevant fields nullable. Not elegant, but reliable:

// Instead of oneOf with different shapes
interface ExtractedField {
  field_type: 'currency' | 'date' | 'text' | 'address';
  raw_value: string;
  // Currency fields
  amount?: number;
  currency_code?: string;
  // Date fields  
  iso_date?: string;
  // Address fields
  street?: string;
  city?: string;
  postal_code?: string;
}

3. Quality tradeoffs: This is real, but it’s more nuanced than people make it sound. I haven’t seen quality degradation on straightforward extraction tasks (pull the invoice number, date, line items). Where I have seen it is on tasks that require reasoning, like classifying a support ticket’s priority based on the content. For those, I’ll actually do a two-pass approach: first call is unconstrained to let the model reason freely, second call takes that reasoning and formats it into the schema.

4. Open-source: XGrammar is solid for self-hosted. We tested it with Llama 3.3 70B quantized and the benchmarks are accurate, overhead is minimal. The main gotcha is that grammar compilation can be slow for very complex schemas, so you want to cache your compiled grammars.

For your invoice/contract pipeline specifically, I’d go with native structured outputs from your provider for the extraction step. The 95% success rate with prompt-and-retry sounds fine until you’re processing 10k documents and that 5% means 500 retries clogging your queue. Constrained decoding turns it into a non-issue.