I’ve been experimenting with Claude Code for autoresearch-style agent improvement loops: run the agent, inspect traces, find what went wrong, propose a fix, run evals, keep what improves, repeat. Claude Code is already good at proposing fixes when it has the right agent traces and a way to verify its changes. But this didn’t really translate to production environments: I didn’t have the observability or guardrails to trust it, let it run, or understand what was actually changing. So I built Kyoko, a fully local system for measuring, debugging, and improving agents with Claude Code. Add telemetry through a skill, run your agent, and Kyoko shows where performance breaks across runs. It groups recurring failures into evidence-backed issues, lets Claude Code draft fixes, and only applies changes after checks and evals pass. It is built around the manual dev workflow: inspect traces, understand the failure, patch the prompt, context, or harness, rerun evals, and decide what ships. The point is to make that workflow repeatable without turning it into a black box. The workflow is: Capture agent runs / traces Find failures that repeat across runs Turn them into reviewable issues with evidence Let Claude Code draft a fix Rerun the failing trace, run deterministic checks, compare eval results Apply the fix only if it passes the gate, otherwise park it for review Everything is local by default: SQLite database, dashboard, traces, issues, proposals, evals. For the analysis and fix-drafting step, Kyoko can use the Claude Code CLI. Open-sourced it here, please let me know what you think: https://github.com/kayba-ai/kyoko submitted by /u/Lucky_Historian742
Originally posted by u/Lucky_Historian742 on r/ClaudeCode
