JSON mode silently truncates output and downstream parser eats the error

An extraction pipeline asked for structured JSON with `max_tokens: 1024`. For long documents the JSON object got cut off mid-field. The retry logic caught the parse error but logged it at DEBUG. Six weeks later we noticed the extraction success rate had silently dropped from 99% to 76%.

What happened

Document extraction agent. Prompt: "Return all entities as JSON: { people: [...], orgs: [...], locations: [...], dates: [...], events: [...] }". max_tokens: 1024, response_format: { type: "json_object" }.

For most documents the output fit in 1K tokens. For longer documents — research papers, multi-page contracts — the model produced perfectly-formed JSON for the first three fields, then ran out of token budget mid-array:

{
  "people": ["Alice Chen", "Bob Park", "Carol Singh"],
  "orgs": ["Acme Corp", "Beta Industries"],
  "locations": ["San Francisco", "Tokyo", "Berl

Output truncated. JSON parse failed. The wrapper logged:

except json.JSONDecodeError as e:
    logger.debug(f"Parse failed, retrying: {e}")
    return None

return None propagated up. Downstream the consumer treated None as "no entities found in this document" and continued. No metric was emitted for the failure.

Diagnosis

Three failures stacked:

1. max_tokens too low for the worst case. 1024 was sized against the median document, not the p99.

2. json_object mode without a schema doesn't give finish-reason guarantees. The API returned finish_reason: "length" (i.e. truncated), but the wrapper didn't check it before parsing.

3. DEBUG-level logging on a parse error. Should have been WARN or ERROR with a metric counter.

The fix

def extract(doc):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": doc}],
        response_format={"type": "json_object"},
      max_tokens=1024,
+       max_tokens=4096,
    )
    msg = response.choices[0]
+   if msg.finish_reason == "length":
+       logger.error("Output truncated", extra={"doc_id": doc.id})
+       metrics.increment("extraction.truncated")
+       raise TruncatedOutputError(doc.id)
+
    try:
        return json.loads(msg.message.content)
    except json.JSONDecodeError as e:
      logger.debug(f"Parse failed: {e}")
      return None
+       logger.error("Malformed JSON", extra={"doc_id": doc.id, "raw": msg.message.content[:500]})
+       metrics.increment("extraction.malformed")
+       raise

Plus, switched to structured output mode (response schema) where supported, which guarantees parseable output:

response_format={
    "type": "json_schema",
    "json_schema": {
        "name": "entities",
        "strict": True,
        "schema": EntitySchema.model_json_schema(),
    },
}

After fix + reprocessing: extraction rate restored to 99.2%, and the team has alerts on the new metrics.

Takeaway

Always check finish_reason before parsing model output. Always emit metrics on parse failures, never just log-and-swallow. And prefer schema-guided structured output over loose json_object mode — the API will refuse to truncate mid-object.