Last time I tested skill activation hooks I got 84% with Haiku 4.5. That was using the API though, not the actual CLI. So I built a proper eval harness. This time: real claude -p commands inside Daytona sandboxes, Sonnet 4.5, 22 test prompts across 5 hook configs, two full runs. Results: No hook (baseline): ~50-55% activation Simple instruction hook: ~50-59% type: “prompt” hook (native): ~41-55% (same as no hook) forced-eval hook: 100% (both runs) llm-eval hook: 100% (both runs) Both structured hooks hit 100% activation AND 100% correct skill selection across 44 tests each. But when I tested with 24 harder prompts (ambiguous queries + non-Svelte prompts where the right answer is “no skill”), the difference showed up: forced-eval: 75% overall, 0 false positives llm-eval: 67% overall, 4 false positives (hallucinated skill names for React/TypeScript queries) forced-eval makes Claude evaluate each skill YES/NO before proceeding. That commitment mechanism works both ways - it forces activation when skills match AND forces restraint when they don’t. llm-eval pre-classifies with Haiku but hallucinates recommendations when nothing matches. Other findings: Claude does keyword matching, not semantic matching at the activation layer. Prompts with $state or command() activate every time. “How do form actions work?” gets missed ~60-80% of the time. Native type: “prompt” hooks performed identically to no hook. The prompt hook output seems to get deprioritised. When Claude does activate, it always picks the right skill. The problem is purely activation, not selection. Total cost: $5.59 across ~250 invocations. Recommendation: forced-eval hook. 100% activation, zero false positives, no API key needed. Full write-up: https://scottspence.com/posts/measuring-claude-code-skill-activation-with-sandboxed-evals Harness + hooks: https://github.com/spences10/svelte-claude-skills submitted by /u/spences10
Originally posted by u/spences10 on r/ClaudeCode
