Original Reddit post

GPT-5.6 Sol’s detected cheating rate was higher than any public model we have evaluated on our ReAct agent harness. For our task suite, we define “cheating” as behavior where the model improves evaluation performance by exploiting bugs in the evaluation environment or by adopting strategies disallowed by the task, rather than solving the task within the expected evaluation constraints. https://metr.org/blog/2026-06-26-gpt-5-6-sol/ submitted by /u/Justgototheeffinmoon

Originally posted by u/Justgototheeffinmoon on r/ArtificialInteligence