Original Reddit post

I was looking at benchmarks scores on Humanity Last exam, and looking at their official website: https://agi.safe.ai/ I see 34.2% for Opus 4.6. This is pretty different from the score claimed by Anthropic, even without tools: https://www.anthropic.com/news/claude-opus-4-6 . The source is the CAIS website: https://dashboard.safe.ai/ Where I also see quite a large difference on ARC-AGI-2, with only 44.2% for Opus 4.6 vs 64% to 69% on ARC-AGI reported results The difference does not come from reasoning efforts since even on low the minimum is 64% in their reported results. Could the context window or maximum output tokens be the reason? Or an evaluation methodology (zero-shot vs few-shots)? submitted by /u/Hydrox__

Originally posted by u/Hydrox__ on r/ArtificialInteligence