Original Reddit post

Anthropic made a pretty important change: skill-creator now supports creating + running evals (not just generating a skill). that’s a bigger deal than it sounds, because it pushes the ecosystem toward the right mental model: skills/context are software → they need tests. this matters because the first version of a context/skill often “feels” helpful but isn’t measurable. evals force you to define scenarios + assertions, run them, and iterate - which is how you discover whether your skill actually changes outcomes or just adds tokens. what i like the most is eval creation being part of the default workflow. 2 early findings: local eval runs can be fragile + memory-heavy, especially once you’re testing against real repos/tools. if your eval depends on local env/repo state, reproducibility can get messy. wrote a couple of deeper thoughts into this on https://tessl-io-sanity.vercel.app/blog/anthropic-brings-evals-to-skill-creator-heres-why-thats-a-big-deal/ hoenst disclosure: i work at tessl.io , where we build tooling around skill/context evaluation (not trying to pitch here). if you’re already using Claude Code and you want evals to be repeatable across versions/models + runnable in CI/CD, we’ve got docs on that and I’m happy to share if folks are interested. submitted by /u/jorkim_32

Originally posted by u/jorkim_32 on r/ClaudeCode