Original Reddit post

Something that’s been bugging me about how AI agent stacks delegate work between models. When a host model delegates to a cheaper specialized one (summarization, extraction, that sort of thing), the dominant pattern is to inline the document directly into the delegation message. The host model pays expensive-tier output rates to compose and emit thousands of tokens of content it isn’t actually reasoning about; it’s a literal courier, billed by the syllable. Concrete example, for a real ~45,000-token document: Before (pass-by-value): { “model”: “summarizer”, “messages”: [ { “role”: “user”, “content”: “Summarize:\n\n[~45,000 tokens of document inlined here]” } ] } The host composes all 45,000 tokens. At Sonnet 4.6’s $15/M output rate, that’s ~$0.67 per delegation paid to do nothing but courier the doc. At Opus 4.6’s $25/M, ~$1.13. Multiplied across the thousands of delegations a busy agent runs in a month, the courier tax stops looking small. After (pass-by-reference, what FlashQuery does): { “resolver”: “purpose”, “name”: “summarization”, “messages”: [ { “role”: “user”, “content”: “Summarize:\n\n{{ref:Some/Doc.md}}” } ] } Host composes ~25 tokens. So I’m resolving the reference and injecting the content into the delegated call only; the host never sees the document. Roughly an 1,800x reduction in what the host has to compose and emit, per delegation. Software figured pass-by-reference out decades ago; the LLM orchestration layer mostly hasn’t. The underlying models can handle references fine; the frameworks above them haven’t been built to provide any. FlashQuery is the open-source data layer I’ve been building for memory, documents, and structured records across any LLM, and this is one of the things it does. Full write-up (counterarguments, the projections pattern, the math): https://flashquery.ai/post/pass-by-reference-llm-harness.html Project: https://github.com/FlashQuery/flashquery submitted by /u/jetstros

Originally posted by u/jetstros on r/ClaudeCode