Original Reddit post

Anthropic quietly shipped an updated skill creator plugin and I wanted to see what it actually does in practice, not just read the blog post. So I pointed it at one of my existing skills - a title generation skill for YouTube videos. I had no idea if this skill was actually helping or if Claude would do just as well without it. Here is what happened. What the skill creator actually gives you Six things, and they all matter: Evals

  • You write a prompt, define what the expected behavior should be, and the system tells you pass or fail. No more “try it twice and hope for the best.” Benchmarks
  • Track three things over time: eval pass rate, how long the skill takes to run, and token usage. So you can see if changes make things better or worse. Multi-agent parallel testing
  • It spins up multiple independent agents in clean isolated contexts. Each runs your skill separately. No cross-contamination between tests. A/B comparator
  • Runs a blind comparison between two versions of a skill. The grading agents don’t know which version is which. You get an honest answer about which one performs better. Description optimization
  • Analyzes your skill’s description against sample prompts and tells you if it’s triggering correctly. Anthropic says they improved triggering on 5 out of 6 of their own public document skills just by optimizing descriptions. If your skills aren’t firing when they should, the description is probably the issue, not the skill itself. Four-agent pipeline under the hood
  • An executor, a grader, a comparator, and an analyzer working together. The two skill types (this changes how you test) Anthropic defines two categories: Capability uplift
  • Skills that teach Claude something it’s not good at on its own. Example: front-end design. These can decay when a smarter model comes out because the model already learned what the skill was teaching. The eval framework catches that. Encoded preference
  • Skills that tell Claude to follow a specific workflow or order. Claude already knows how to do each step, but you want it done your way every time. These need fidelity testing (is it following all instructions in the right order?) rather than A/B testing. What happened when I tested my title skill I asked the skill creator to test my title generation skill. Here’s the process: It generated eval test cases automatically with specific assertions (output must contain at least 5 title options, must include weighted scoring, each title must cite a specific source, etc.) It spawned 6 parallel test runs - 3 with my skill loaded and 3 without as a baseline While waiting, it analyzed the current skill and found 7 specific weaknesses: no concrete examples of good vs bad titles, rigid hook variants, no guidance on title length, no negative examples showing patterns to avoid The baseline runs showed decent quality but lacked structured scoring - so the skill was adding value, just not as much as it could It rewrote the skill targeting all 7 weaknesses, then re-ran the evals Results: 100% pass rate with the improved skill (33/33 assertions) vs about 60% without it (20/33). You also get a localhost dashboard where you can review outputs side by side and provide feedback. How to install Open Claude Code. Type /plugin . Search for “skill creator”. Install it. Run /reload-plugins to load it. That’s it. The thing that matters most You can now continuously eval your skills whenever a new model drops. You stop guessing and start measuring. The description optimization alone is worth installing it - fixing wrong triggers was one of my biggest frustrations with skills. I recorded my test with claude skills 2.0 here -> https://youtu.be/jaUUoHWXz7Y submitted by /u/hashpanak

Originally posted by u/hashpanak on r/ClaudeCode