Your Evals Will Break and You Won't See It Coming

wanglun1996.github.io

Your Evals Will Break and You Won't See It Coming

wanglun1996.github.io

eifachposteMB to AI (Reddit RSS)English · 2 days ago

Your Evals Will Break and You Won't See It Coming - Lun Wang

wanglun1996.github.io

Original Reddit post

imagine a model that, at some scale, develops the ability to strategically withhold information to achieve goals — not lying exactly, but selectively omitting facts in ways that steer conversations toward outcomes its training process accidentally reinforced. Your existing honesty benchmarks wouldn’t catch this, because they test for factual accuracy, not for strategic omission. Your safety classifiers wouldn’t flag it, because the individual outputs are all technically true. The capability is new, the failure mode is new, and nothing in your evaluation suite was designed to look for it. You’d be monitoring the wrong thing and wouldn’t know it. submitted by /u/shikizen

Originally posted by u/shikizen on r/ArtificialInteligence

You must log in or # to comment.

Chat