so i hit this point where i was staring at 4 different ai tools (rootly, incident io, datadog’s bits ai, and a couple others i wont name here) all promising to do the exact same thing and realized i had zero framework for picking between them. i was just going off whatever had the best demo video, Twitter hype, benchmarks etc. which in hindsight is a dumb way to make infra decisions. the thing that actually taught me something was throwing one of them at a live incident and watching it generate 47 alerts off a single log line. i was like oh. so yeah i needed to figure out what i actually wanted out of these before letting them near prod, instead of just. so here’s the stuff i landed on, mostly from getting it wrong first. first one is there’s a real gap between tools that find problems and tools that help you understand them. most of these are great at the finding part, they’ll scan your logs and metrics and just scream at you. the understanding part is way harder. i had one that flagged memory spikes for weeks and never once connected them to the fact that they lined up exactly with our deploy schedule, which was great to figure out on my own. the other one, and this is the one that changed how i evaluate this stuff, is context beats accuracy. i kept comparing tools on “how many incidents did it catch” when i shouldve been asking how much each alert actually handed me. one tool caught fewer things but every alert came with the diff of what changed and a timeline of the related metrics and a rough guess at cause and that was WAY more useful than the thing that caught everything and just linked me a log line to go read myself. (which sounds obvious typed out, it was not obvious to me at 2am.) then theres the customization angle. the tools that let you actually mess with the logic were the ones that stuck around. like we use coderabbit for code review and the part that made it stick was being able to tweak what patterns it flags so it fits our codebase instead of nagging about stuff we dont care about. same idea on the sre side. if you cant tell a tool “ignore this metric between 2 and 4am because thats just batch jobs” its going to bury your team in noise until everyone quietly stops looking at it. which is sort of the whole game. everyone optimizes for catching everything and nobody prices in alert fatigue. id rather miss something minor than have the whole team start ignoring the alerts, which is exactly what happens once the noise crosses some line. the tool that let me set a confidence threshold was the one people actually left turned on. also nobody warns you how much it matters that the thing fits your existing setup. i tried one that wanted its own dashboard and its own slack integration and its own pagerduty config and by the time id wired all that up i could’ve just written the alert myself. the ones that worked just plugged into what we already had. anyway the part im still stuck on is how you even measure roi on any of this. the oncall team seems calmer but i cant exactly put “vibes improved” on a slide for my manager. maybe its just that if your team isnt ignoring the alerts then the tool is working but idk submitted by /u/notomarsol
Originally posted by u/notomarsol on r/ClaudeCode
