eifachposte

eifachposte

Current structured output benchmarks only validate pass rate for json schema and types, however more commonly the issue tends to be inaccurate json values. For example hallucinated total_price number when extracting value from a invoice or an array ordered wrongly because of inaccurate date mapping. The Structured output benchmark measures 7 key metrics instead of json schema. Value Accuracy (primary): exact leaf-value match against verified ground truth JSON Pass Rate, Type Safety, Path Recall, Structure Coverage (structural) Faithfulness: are values grounded in context or hallucinated? Perfect Response: every single leaf value correct Modalities: text, image and audio Overall results Overall benchmark results Open source is doing pretty well with GLM 4.7 coming number 2 right below GPT 5.4. JSON-pass vs Value-Accuracy gap JSON-pass vs Value-Accuracy gap What’s interesting here is that while most models hit 90%+ on JSON schema pass, all of them drop significantly on value accuracy. Overall best by modality Overall best by modality Full breakdown blog: https://interfaze.ai/blog/introducing-structured-output-benchmark Full leaderboard: https://interfaze.ai/leaderboards/structured-output-benchmark Paper: https://interfaze.ai/sob_paper.pdf (Pending arXiv) The full break down goes deeper into different modalities, how we designed the dataset, and how we performed the benchmark. All code and dataset is open source 😄 Our goal is to be the best general model for deterministic tasks and a key aspect of determinism is controllable and consistent output structure. submitted by /u/404llm

Originally posted by u/404llm on r/ArtificialInteligence

The Structured Output Benchmark (SOB) - validates both JSON parse and value accuracy

The Structured Output Benchmark (SOB) - validates both JSON parse and value accuracy