Original Reddit post

I built a Claude Code plugin that generates ADE-Bench tasks from real dbt projects: https://github.com/typedef-ai/ade-bench-plugin The problem I’m trying to solve: eval runners are relatively easy to build, but good eval tasks are still mostly hand-written. For coding/data agents, useful tasks need real repo context: dependencies, tests, business logic, edge cases, and failure modes. Otherwise it’s very easy to end up with toy tasks that don’t tell you much. This plugin generates tasks using a setup / solve / verify loop: mutate a working dbt project with a realistic bug give the agent only the observable symptom verify the fix with dbt tests and table comparisons Example: introduce a subtle join/aggregation bug in a model, prompt the agent with the downstream data issue, then verify the repaired output against expected tables. Current commands: plan-tasks: Claude Code inspects the project and helps design benchmark tasks create-task: generates tasks from known SQL/dbt bug patterns I’m interested in how other people building with Claude Code are handling evals. Are you writing tasks by hand, replaying real issues from your repos, generating synthetic tasks, or mostly relying on manual testing? The part I’m trying to understand better is how to create eval tasks that are realistic enough to matter, but still reproducible and automatically verifiable. submitted by /u/cpardl

Originally posted by u/cpardl on r/ClaudeCode