Original Reddit post

I’ve been working on logreduce, a small static binary that takes a noisy log file and produces a much smaller summary for feeding into an LLM — the goal being to keep every distinct error/event while cutting the repetitive noise that eats up context and tokens. The core idea is TF-IDF ranking over masked templates: timestamps, UUIDs, IPs and numbers get replaced with placeholders so that thousands of near-identical lines collapse into one template instead of looking unique. Every distinct template is guaranteed at least one representative line, and it redacts likely secrets/PII (JWTs, API keys, emails) before any scoring happens on a best effor basis. Benchmarked against LogDx’s 35-case real CI-incident dataset with human-labelled ground truth: roughly 99% critical-signal recall at both 8K and 2K token budgets. It also has optional integrations for Claude Code (hook + skill), Cursor (project rule), and GitHub Copilot (custom instructions), so a coding agent can run logs through it before reading them directly — no Node or MCP server needed, just the one binary. Written in Rust (tiktoken-rs, regex, rayon, clap, serde). cargo install logreduce or grab a prebuilt binary from releases. GitHub: https://github.com/alexcpn/log_tfidf_reducer Curious what people think of the masking/scoring heuristics — and happy to take real-world log samples that trip it up submitted by /u/alexcpn

Originally posted by u/alexcpn on r/ClaudeCode