Current structured output benchmarks only validate pass rate for json schema and types, however more commonly the issue tends to be inaccurate json values.
For example hallucinated total_price number when extracting value from a invoice or an array ordered wrongly because of inaccurate date mapping.
The Structured output benchmark measures 7 key metrics instead of json schema.
Value Accuracy (primary): exact leaf-value match against verified ground truth
JSON Pass Rate, Type Safety, Path Recall, Structure Coverage (structural)
Faithfulness: are values grounded in context or hallucinated?
Perfect Response: every single leaf value correct
Modalities: text, image and audio
Overall results
Overall benchmark results
Open source is doing pretty well with GLM 4.7 coming number 2 right below GPT 5.4.
JSON-pass vs Value-Accuracy gap
JSON-pass vs Value-Accuracy gap
What’s interesting here is that while most models hit 90%+ on JSON schema pass, all of them drop significantly on value accuracy.
Overall best by modality
Overall best by modality
Full breakdown blog:
https://interfaze.ai/blog/introducing-structured-output-benchmark
Full leaderboard:
https://interfaze.ai/leaderboards/structured-output-benchmark
Paper:
https://interfaze.ai/sob_paper.pdf
(Pending arXiv)
The full break down goes deeper into different modalities, how we designed the dataset, and how we performed the benchmark. All code and dataset is open source 😄
Our goal is to be the best general model for deterministic tasks and a key aspect of determinism is controllable and consistent output structure.
submitted by
/u/404llm
Originally posted by u/404llm on r/ArtificialInteligence
