Original Reddit post

I’ve been using Claude Code daily (Opus 4.6, Max plan) and wanted Programmatic Tool Calling. Quick explanation for those unfamiliar: with normal tool calling, every operation the agent performs (read a file, search for a pattern, read another file…) is a separate round-trip that dumps its full result into the context window. PTC flips this — the agent writes code that runs in an isolated environment, does all the work there, and only the final processed result enters the context. The intermediate steps never touch the context window. Problem is, PTC isn’t available in Claude Code yet — it’s only on the API. So I built it myself. The repo is private — I’m not here to promote anything, just sharing data. What I built Thalamus is a local MCP server that gives Claude Code a PTC-like capability. The core idea: instead of the agent making 10 separate tool calls (grep, read, grep, read…), it writes a Python block that runs in an ephemeral subprocess with pre-loaded primitives for filesystem, memory, and conversation navigation. Only the processed result comes back into context. Four tools: execute() (runs Python with primitives), search , remember , context . 143 tests, Python stdlib only, fully local. Important caveat upfront : this is my own implementation, not Anthropic’s. The architecture decisions I made — how primitives work, how the subprocess is structured, what’s exposed — directly affect the results. If Anthropic shipped PTC natively in Claude Code, the numbers could look very different. I’m sharing this as one data point from a real user who wanted PTC badly enough to build it, not as a definitive study. What the industry claims vs what I measured Anthropic’s blog reports 98.7% token reduction. Cloudflare says 81% on complex tasks. These are measured on optimal scenarios (massive APIs, data-heavy pipelines). I parsed the raw JSONL session files from 79 real sessions over a week of daily work: Meaningful on the right tasks. But my real-world daily mix is far from 98%. What the agent actually does inside execute() This is the part I didn’t expect. I did content analysis on all 112 execute() calls: 64% used standard Python (os.walk, open, sqlite3, subprocess) — not my PTC primitives at all 30% used a single primitive (one fs.read or fs.grep) 5% did true batching (2+ primitives combined) The “replace 5 Reads with 1 execute” pattern? 5% of actual usage. The agent mostly used execute() as a general-purpose compute environment — accessing files outside the project, running aggregations, querying databases. Valuable, but not for the reasons I designed it. Now — is this because my primitives aren’t well designed enough? Because the system prompt instructions could be better? Because the agent naturally gravitates toward stdlib when given a Python sandbox? Honestly, I don’t know. It could be any or all of these. Adoption doesn’t happen on its own First measurement: only 25% of sessions used PTC. The agent defaulted to Read/Grep/Glob every time. I added a ~1,100 token operational manual to my CLAUDE.md. Adoption jumped to 42.9%. Without explicit instructions, the agent won’t use PTC even when it’s available. This matches what I’ve read about Cloudflare’s approach — they expose only 2 tools for 2,500+ endpoints, making code mode the only option. Edit-heavy sessions don’t benefit Sessions focused on writing code (Edit + Bash dominant) showed zero PTC usage. PTC seems to shine in analysis, debugging, and cross-file research — not in the most common development workflow. I haven’t seen anyone make this distinction explicitly. Where I’d genuinely appreciate input I built this because no one was giving me PTC in Claude Code, and I wanted to see if the hype matched reality. The answer is “partially, and differently than expected.” But I’m one person with one implementation. If you’ve built similar tooling or used Cloudflare’s Code Mode / FastMCP: does the “general-purpose compute” pattern match your experience, or is it specific to my setup? Are there architectural choices I might be getting wrong that would explain the low batching rate? Has anyone measured PTC on real daily work rather than benchmarks? I’d love to compare notes. Any feedback, criticism, suggestions — genuinely welcome. This is a solo project and I’d love to improve it with community input. submitted by /u/samuel-gudi

Originally posted by u/samuel-gudi on r/ClaudeCode