Original Reddit post

Current structured output benchmarks only validate pass rate for json schema and types, however more commonly the issue tends to be inaccurate json values. For example hallucinated total_price number when extracting value from a invoice or an array ordered wrongly because of inaccurate date mapping. The Structured output benchmark measures 7 key metrics instead of json schema. Value Accuracy (primary): exact leaf-value match against verified ground truth JSON Pass Rate, Type Safety, Path Recall, Structure Coverage (structural) Faithfulness: are values grounded in context or hallucinated? Perfect Response: every single leaf value correct Modalities: text, image and audio Overall results Overall benchmark results Open source is doing pretty well with GLM 4.7 coming number 2 right below GPT 5.4. JSON-pass vs Value-Accuracy gap JSON-pass vs Value-Accuracy gap What’s interesting here is that while most models hit 90%+ on JSON schema pass, all of them drop significantly on value accuracy. Overall best by modality Overall best by modality Full breakdown blog: https://interfaze.ai/blog/introducing-structured-output-benchmark Full leaderboard: https://interfaze.ai/leaderboards/structured-output-benchmark Paper: https://interfaze.ai/sob_paper.pdf (Pending arXiv) The full break down goes deeper into different modalities, how we designed the dataset, and how we performed the benchmark. All code and dataset is open source 😄 Our goal is to be the best general model for deterministic tasks and a key aspect of determinism is controllable and consistent output structure. submitted by /u/404llm

Originally posted by u/404llm on r/ArtificialInteligence