I’ve been working on logreduce, a small static binary that takes a noisy
log file and produces a much smaller summary for feeding into an LLM —
the goal being to keep every distinct error/event while cutting the
repetitive noise that eats up context and tokens.
The core idea is TF-IDF ranking over masked templates: timestamps,
UUIDs, IPs and numbers get replaced with placeholders so that thousands
of near-identical lines collapse into one template instead of looking
unique. Every distinct template is guaranteed at least one
representative line, and it redacts likely secrets/PII (JWTs, API keys,
emails) before any scoring happens on a best effor basis.
Benchmarked against LogDx’s 35-case real CI-incident dataset with
human-labelled ground truth: roughly 99% critical-signal recall at both
8K and 2K token budgets.
It also has optional integrations for Claude Code (hook + skill), Cursor
(project rule), and GitHub Copilot (custom instructions), so a coding
agent can run logs through it before reading them directly — no Node or
MCP server needed, just the one binary.
Written in Rust (tiktoken-rs, regex, rayon, clap, serde).
cargo install logreduce or grab a prebuilt binary from releases.
GitHub:
https://github.com/alexcpn/log_tfidf_reducer
Curious what people think of the masking/scoring heuristics — and happy
to take real-world log samples that trip it up
submitted by
/u/alexcpn
Originally posted by u/alexcpn on r/ClaudeCode
