Original Reddit post

I keep seeing LLM agents wired to tools with basically no app-layer safety. The common failure mode is: the agent ingests untrusted text (web/email/docs), that content steers the model, and the model then calls a tool in a way that leaks secrets or performs a destructive action. Model-side “be careful” prompting is not a reliable control once tools are involved. So I open-sourced GuardLLM, a small Python “security middleware” for tool-calling LLM apps: Inbound hardening: isolate and sanitize untrusted text so it is treated as data, not instructions. Tool-call firewall: gate destructive tools behind explicit authorization and fail-closed human confirmation. Request binding: bind tool calls (tool + canonical args + message hash + TTL) to prevent replay and arg substitution. Exfiltration detection: secret-pattern scanning plus overlap checks against recently ingested untrusted content. Provenance tracking: stricter no-copy rules for known-untrusted spans. Canary tokens: generation and detection to catch prompt leakage into outputs. Source gating: reduce memory/KG poisoning by blocking high-risk sources from promotion. It is intentionally application-layer: it does not replace least-privilege credentials or sandboxing; it sits above them. Repo: https://github.com/mhcoen/guardllm/ I’d like feedback on: Threat model gaps I missed Whether the default overlap thresholds work for real summarization and quoting workflows Which framework adapters would be most useful (LangChain, OpenAI tool calling, MCP proxy, etc.) submitted by /u/MapDoodle

Originally posted by u/MapDoodle on r/ClaudeCode