Original Reddit post

If you use Claude Code, you’ve probably seen SKILL .md files. They’re small instruction files you drop into your project and the AI agent loads them as a system prompt, supposedly making it better at specific tasks: writing commit messages, reviewing code, writing docs, whatever the skill claims to do. There are hundreds of them published online. The problem: nobody actually knows if they work. You install one, use it for a week, and form a vague impression. That’s not a measurement. I built SkillBenchmark to fix that. Here’s how it works: You give it a skill and a set of tasks. For each task, it runs the LLM N times — once with the skill injected as the system prompt, once without. Both outputs are sent to a judge LLM that scores them blindly against a rubric: the judge never sees the original task prompt and has no idea which output came from which condition. You get confidence intervals over the scores for both conditions, and a delta with its own CI so you can see whether any observed difference is real or just noise. As a working example, I benchmarked Caveman : a popular skill that claims to cut LLM output tokens by ~65% while maintaining technical accuracy. I ran 3 tasks × 5 runs × 3 judges: All confidence intervals overlap, no statistically confirmed quality improvement on any task. The skill also doubled or quadrupled token cost on every run due to the system prompt injection. Draw your own conclusions; the point is you can now actually measure this instead of guessing. The repo ships with this Caveman example so you can run it immediately without writing anything: just clone, add your API key, and run python run.py. To benchmark your own skill you drop a SKILL.md into skills/ and write task YAML files with a prompt and a scoring rubric. GitHub : https://github.com/TiesPetersen/SkillBenchmark submitted by /u/Ties_P

Originally posted by u/Ties_P on r/ClaudeCode