Anthropic made a pretty important change: skill-creator now
supports creating + running evals
(not just generating a skill).
that’s a bigger deal than it sounds, because it pushes the ecosystem toward the right mental model:
skills/context are software → they need tests.
this matters because the first version of a context/skill often “feels” helpful but isn’t measurable.
evals force you to define scenarios + assertions, run them, and iterate - which is how you discover whether your skill actually changes outcomes or just adds tokens. what i like the most is eval
creation
being part of the default workflow.
2 early findings:
local eval runs can be fragile + memory-heavy, especially once you’re testing against real repos/tools.
if your eval depends on local env/repo state, reproducibility can get messy.
wrote a couple of deeper thoughts into this on
https://tessl-io-sanity.vercel.app/blog/anthropic-brings-evals-to-skill-creator-heres-why-thats-a-big-deal/
hoenst disclosure: i work at
tessl.io
, where we build tooling around skill/context evaluation (not trying to pitch here).
if you’re already using Claude Code and you want evals to be repeatable across versions/models + runnable in CI/CD, we’ve got docs on that and I’m happy to share if folks are interested.
submitted by
/u/jorkim_32
Originally posted by u/jorkim_32 on r/ClaudeCode
