Original Reddit post

There’s a testing gap in AI agent development that I think the broader engineering community hasn’t fully grappled with yet. We have good tooling for: Unit/integration tests for deterministic code Evals for LLM output quality (promptfoo, DeepEval, etc.) Observability for post-deploy monitoring (LangSmith, Datadog) We don’t have mature tooling for: Pre-deploy chaos testing — does the agent survive when its environment breaks? This matters more for agents than for traditional software because: Agents are non-deterministic by design — you can’t assert exact outputs Agents have complex tool dependency graphs — failures cascade in non-obvious ways Agents operate autonomously — a failure that would be caught by a human reviewer in a traditional app goes unnoticed The specific failure class I’m talking about: Traditional chaos engineering tests: “what happens when service X goes down?” Agent chaos engineering tests: “what happens when tool X times out, AND the LLM returns a format your parser doesn’t expect, AND a previous tool response contained an adversarial instruction?” That combination doesn’t show up in evals. It shows up in production at 2am. I spent the last few months building an open source framework (Flakestorm) that applies chaos engineering principles specifically to AI agents. Four pillars: environment faults, behavioral contracts, replay regression, context attacks. Curious what the broader programming community thinks about this problem space. Is pre-deploy chaos testing for agents something your teams are thinking about? What’s your current approach to testing agent reliability before shipping? submitted by /u/No-Common1466

Originally posted by u/No-Common1466 on r/ArtificialInteligence