data pipelines look healthy until they’re not. everything green, metrics stable, no alerts. then you realize downstream data is wrong and nothing actually failed loudly. our setup is pretty typical: spark -> kafka -> db, with dashboards and alerts on lag and error rates. works fine for obvious failures. the issue is the silent ones. schema drift that only breaks one consumer. partition skew that degrades performance slowly. nodes running unevenly but not enough to trigger alerts. last week we had a pipeline that dropped ~20% of events because a parser started failing on a new data pattern. no alert, nothing obvious in metrics, and logs were too noisy to catch it early. we’ve tried adding more checks like record counts and validation at different stages, but it quickly turns into noise. how are you catching these kinds of silent failures early without overwhelming the system with alerts? what’s actually worked for you submitted by /u/Impressive_Film2188
Originally posted by u/Impressive_Film2188 on r/ArtificialInteligence
