I built Artificial General Research (AGR) , a Claude Code skill that turns any measurable software problem into an autonomous optimization loop. You define a metric (speed, bundle size, etc.) and a guardrail (tests, checksums). AGR experiments, measures, commits successes, and discards failures indefinitely. While heavily inspired by the autoresearch concepts from Andrej Karpathy and Udit Goenka, running those loops exposed three scaling walls that AGR is built to solve:
- Context Degradation → Stateless Iterations Running 50+ experiments in one conversation destroys the agent’s context window. AGR uses a stateless “Ralph Loop”: every iteration spins up a fresh Claude Code instance. It reconstructs context by reading a persistent STRATEGY.md and results.tsv . Iteration 100 is just as sharp as Iteration 1.
- Measurement Noise → Variance-Aware Acceptance High overall benchmark variance (e.g., ±1s) often masks legitimate micro-improvements (e.g., 120ms). AGR evaluates sub-benchmarks independently, accepting any experiment where a sub-benchmark improves >5% without regressing others.
- Speed vs. Correctness → The Rework Phase Standard loops discard brilliant algorithmic optimizations if there’s a minor syntax error. AGR separates the metric from the guard. If an experiment improves the metric but fails a test, it triggers a 2-attempt “rework” phase to fix the implementation rather than trashing the idea. Real-World Results Tested on a C++/Python spatial analysis library: Execution time: 53.54s → 28.73s (-46.3%) 14 autonomous experiments: 7 kept, 7 discarded. It systematically moved from micro-optimizations (replacing std::pow(x,2) with x*x ) to memory improvements, and finally architectural changes (vectorizing a Kernel Density Estimation to bypass scikit-learn entirely) when the strategy doc detected a plateau. submitted by /u/RelativeJealous6192
Originally posted by u/RelativeJealous6192 on r/ClaudeCode
You must log in or # to comment.
