Original Reddit post

Hey! I came back with some results :) Free tool: https://grape-root.vercel.app/ 70+ people using it :) they gave 3.7/5 average rating Still improving with feedbacks! Okay, so: I’m trying to properly validate a tool I’ve been building around Claude Code using Claude code. The goal is to reduce redundant repo exploration during multi-turn coding sessions. Instead of letting the model rediscover files every turn, it keeps a lightweight graph/state of the repo so follow-ups don’t start from scratch. I didn’t want to rely on “feels faster” claims, so I ran benchmarks. Benchmarks tested SWE-bench Lite Result: ~25% token cost reduction on average. The improvement mainly comes from avoiding the typical pattern: grep → wrong file → grep again → explore again The graph layer front-loads relevant files so Claude skips some of that exploration loop. Some instances were much better: astropy-12907 → ~68% cheaper But trivial bugs were worse because the graph overhead isn’t worth it as mostly single-turn tasks. RepoBench v1.1 Accuracy stayed roughly the same as baseline(Normal Claude code). But cost was almost similar because RepoBench tasks are mostly single-turn completions . So the graph overhead never pays off. Also, How and where can i show these results which would look more validated? What I realized My tool actually performs best when the workflow looks like this: Prompt 1 → explore repo Prompt 2 → refine bug Prompt 3 → adjust fix Prompt 4 → edge cases Basically follow-up prompts . But most benchmarks seem to measure single-turn tasks , which doesn’t really represent real coding sessions. My question If the thing you’re testing is multi-turn repo navigation , what benchmark dataset actually makes sense? Right now I’m considering two options: Write a custom multi-turn benchmark script that simulates follow-up prompts Use some dataset that already exists for agent / multi-turn code tasks Datasets I’ve looked at: SWE-bench RepoBench Defects4J But none of them seem designed for persistent repo state across prompts as most of them are single turn prompts or mostly 4-5! Curious what people here think If you were trying to benchmark something like: repo navigation follow-up prompts multi-turn coding agents what dataset would you trust? Or is the only real option to build a custom benchmark for it? submitted by /u/intellinker

Originally posted by u/intellinker on r/ClaudeCode