eifachposte

eifachposte

Denise Holt:🔴 Seed IQ is now at 10/10 games solved on ARC-AGI 3 🥳🙌🏻 This week we’ve had a lot of people suggesting that our posts are representative of our own report/interpretation of scores/performance and that they are somehow “not official.” We’ve also had accusations of “faking it.” ➡️ Make no mistake, these LIVE Scorecards ARE the OFFICIAL evaluation validated by ARC Prize, themselves, of Seed IQ’s performance. The scorecards sit on the ARC Prize website, generated by them, not us. These details are served up from their end recording & evaluating all of the details of game performance on every level of every game Seed IQ plays. They even include replays of every level. 🔸 It doesn’t get more official than this.🔸 ▪️The only thing that is not happening for us it placing Seed IQ on the leaderboard. And that is due to the fact that the ARC Prize rules state that you have to turn over your entire codebase & commercial rights to your system in order to be recognized as a contender on the leaderboard (officially entering the contest portion of the benchmark). ▪️We asked for a private evaluation, we offered to forgo prize money, and Greg Kamradt told us that option wasn’t available at this time. ▪️Yet, they clearly do it for the frontier models. Last week they evaluated both ChatGPT 5.5 (scored 0.43%) and Claude Opus 4.7 (score 0.18%), and he gave a detailed report of what they observed of those models performance on the backend. ▪️After I posted about our 5th game win, Greg commented on X about the steps he observed on the backend of our play, and he asked me what priors we are using. ➡️ They see everything we are doing. They are giving us our OFFICIAL SCORES. (If this was something you could fake, why don’t you see anyone else posting scores like this? Why wouldn’t the ARC Prize folks be calling us out for cheating? I’ve seen them call out people for spreading misinformation about the contest.) You would think they would acknowledge Seed IQ’s performance publicly, the same way they do frontier models who clearly aren’t turning over their codebase either, especially because we are the only system acing these challenges and crushing this benchmark. ▪️ARC Prize has positioned themselves as an entity to evaluate the best of AI. They have made it clear in the past that they do not believe DL/RL has any ability to adapt or to reason, plan, and act across novel environments. ARC-AGI 3 was positioned as an effort to spotlight advanced systems who actually can do that, and yet proprietary systems are being ignored while the entire benchmark is catering to DL/RL systems who cannot even score 1% on the challenges. It begs a much deeper question about the real objective of this benchmark. 🤷🏻‍♀️ ✅ Either way, we’ll keep letting Seed IQ play their games because regardless of the leaderboard, the benchmark is still acting as an official evaluation and validation of its performance. 🥳🚀 LIVE Scorecard for 10/10 games in comments… #AIX #SeedIQ https://arcprize.org/scorecards/b65d86f3-d36f-43cb-abf9-bfa4e138d7d8 submitted by /u/Fit_Transition8824

Originally posted by u/Fit_Transition8824 on r/ArtificialInteligence