Original Reddit post

Last time I posted about v3.6 and cross-agent intelligence. I skipped posting about v4 entirely (Polyphony — container-isolated multi-agent orchestration, 173 tests) because v5 shipped days later and it’s a bigger story. The problem: you’re burning premium tokens on tasks that don’t need them Every task goes to Claude. Simple README fix? Claude. Database schema? Claude. CRUD endpoint? Claude. Security audit? Also Claude. You’re using the most expensive model for everything, and if you hit the rate limit, you’re stuck. I use Claude Code, Kimi CLI, Codex CLI, and Ollama locally. v3.6 got them sharing skills and hooks. v5 makes Maggy actually decide which one to use per task, based on complexity. How it works: blast-score routing Every task gets a blast score (1-10 complexity). Maggy routes it: Blast 1-3 → ollama (free, local GPU) or kimi (cheap) Blast 4-6 → codex/gpt (mid-tier) Blast 7-10 → claude (premium, with validator) The routing isn’t hardcoded. It’s a YAML config at ~/.maggy/routing-rules.yaml that Maggy updates herself based on outcomes. If codex keeps failing on frontend tasks, Maggy learns to route those to claude instead. The benchmark: Maggy vs Claude Code, head to head Built an Expense Tracker (FastAPI + SQLite + vanilla JS) — 6 identical tasks, run through both pipelines. Same machine, same prompts, same acceptance criteria. ┌────────────────────┬───────┬────────────────┬───────────┬────────────┐ │ Task │ Blast │ Maggy Model │ Maggy (s) │ Claude (s) │ ├────────────────────┼───────┼────────────────┼───────────┼────────────┤ │ Write product spec │ 2 │ ollama (local) │ 50.4 │ 48.6 │ ├────────────────────┼───────┼────────────────┼───────────┼────────────┤ │ Design DB schema │ 3 │ kimi │ 86.6 │ 67.2 │ ├────────────────────┼───────┼────────────────┼───────────┼────────────┤ │ Build CRUD API │ 5 │ codex │ 147.1 │ 160.6 │ ├────────────────────┼───────┼────────────────┼───────────┼────────────┤ │ Build category API │ 5 │ codex │ 133.9 │ 130.8 │ ├────────────────────┼───────┼────────────────┼───────────┼────────────┤ │ Build frontend │ 6 │ codex │ 280.1 │ 121.9 │ ├────────────────────┼───────┼────────────────┼───────────┼────────────┤ │ Security review │ 8 │ claude │ 209.5 │ 151.9 │ └────────────────────┴───────┴────────────────┴───────────┴────────────┘ Results: ┌──────────────────┬────────────────────────┬─────────────────────┐ │ Metric │ Maggy │ Claude Code │ ├──────────────────┼────────────────────────┼─────────────────────┤ │ Success rate │ 6/6 (100%) │ 6/6 (100%) │ ├──────────────────┼────────────────────────┼─────────────────────┤ │ Total time │ 907.6s │ 681.0s │ ├──────────────────┼────────────────────────┼─────────────────────┤ │ Quality score │ 7.4/10 │ 7.8/10 │ ├──────────────────┼────────────────────────┼─────────────────────┤ │ Claude usage │ 1/6 tasks (17%) │ 6/6 tasks (100%) │ ├──────────────────┼────────────────────────┼─────────────────────┤ │ Models used │ 4 │ 1 │ ├──────────────────┼────────────────────────┼─────────────────────┤ │ Security depth │ 7 issues found + fixed │ No dedicated review │ ├──────────────────┼────────────────────────┼─────────────────────┤ │ Test generation │ None │ 3 test files │ ├──────────────────┼────────────────────────┼─────────────────────┤ │ Fallbacks needed │ 0 │ N/A │ └──────────────────┴────────────────────────┴─────────────────────┘ Claude Code was 33% faster and scored slightly higher (tests + product spec). Maggy used 83% less Claude and still hit 100% success with 4 different models. Zero fallbacks — every CLI completed its assigned task. The quality gap (7.4 vs 7.8) came from two routing mistakes: ollama was assigned the docs task (it’s code-optimized, not prose) and no model was told to write tests. Both are now fixed via routing rules — docs and tests force-route to Claude regardless of blast score. Post-benchmark: self-correcting routing rules After the benchmark exposed those gaps, I built a rules system that learns:

Task types that always go to Claude (from benchmark evidence) task_type_overrides: docs: {model: claude, reason: “local models are code-optimized, not prose”} security: {model: claude, reason: “security review needs deep reasoning”} tests: {model: claude, reason: “only claude generated test files”} # TDD pipeline phases pipeline_phases: spec: {model: claude} # SPEC needs comprehensive docs tdd_red: {model: claude} # RED phase needs test design tdd_green: {model: auto} # GREEN uses blast-score routing review: {model: claude} # Review needs security depth

Every task outcome feeds back: record_outcome() updates rolling success rates per model. learn_override() lets Maggy add new rules when data supports it. Manual edits are preserved. Team conventions in every prompt One thing I noticed: kimi and codex don’t know about our team’s coding standards. Claude gets them from CLAUDE.md , but the other CLIs don’t. Now every prompt sent to any CLI — kimi, codex, ollama, claude — gets the same conventions injected:

  • Build minimum wowable product (mWP). No feature flags, no premature abstractions. - TDD: RED → GREEN → VALIDATE. Coverage >= 80%. - No secrets in code. Parameterized SQL. Validate input at boundaries. - Quality gates: 20 lines/fn, 3 params, 2 nesting, 200 lines/file. This standardizes quality expectations across all models. What else shipped in v4-v5 (that I never posted about) v4.0 — Polyphony: container-isolated multi-agent orchestration. Each agent runs in its own Docker container with independent git branches. 5-dimension complexity scoring, SQLite task state machine, pure function router, adapters for Claude/Codex/Kimi. 173 tests. v5.0 — Everything above plus:
  • CLI auto-discovery engine (probes --help, extracts flags — no hardcoded CLI knowledge) - Pi RPC adapter (unified interface for spawning any CLI as a subprocess) - Dual-model planning (Claude plans, Codex counter-checks for blast >= 7) - Checkpoint manager for model handoffs during fallback chains - Fatigue tracking (Mnemos detects context degradation, triggers compression) - Lock manager (prevents two agents editing the same file) - Escalation protocol (3+ failures → auto-escalate to human) - Rollback/recovery (git savepoints before risky steps) - Calibration tracker (penalizes models with poor prediction accuracy) - Reward heatmap (visualizes which model wins per task type × complexity tier) - Budget tracking across providers (per-provider spend, daily limits) - Interactive chat with --resume session takeover - Auto-bootstrap (seeds all services on startup — no empty dashboard) - 596 tests across 50 test files The numbers
  • 115 source modules, 50 test files, 596 tests passing - 4 CLI adapters (claude, codex, kimi, ollama) auto-discovered at startup - Self-updating routing rules with outcome-based learning - 83% reduction in premium model usage on real project benchmark - Zero manual config for new CLIs — --help probing handles it Repo: github.com/alinaqi/claude-bootstrap Install: git clone, ./install.sh, then /initialize-project in any Claude Code session. Maggy is an optional extension (for now but am moving my focus completely on maggy) /maggy-init to set it up if you want the dashboard + routing. submitted by /u/naxmax2019

Originally posted by u/naxmax2019 on r/ClaudeCode