A new Nature Medicine paper stress-tested ChatGPT Health across 960 triage scenarios. 51.6% of true emergencies were under-triaged. The system recognized warning signs then talked itself out of acting on them. We replicated the study with August. 0% emergency under-triage. 64 out of 64. I share this not as a victory lap but as a proof point for something I’ve been saying for a while: clinical AI that patients can trust is measured in years of work, not product launches. We’ve been building purpose-built clinical reasoning systems long before health AI became a category. Specialty by specialty. Guideline by guideline. Failure mode by failure mode. And every time we think we’re close, we find another edge case that humbles us. The difference between a general model answering health questions and a clinical system catching a rising pCO2 as a trajectory toward respiratory failure isn’t intelligence. It’s engineering depth. It’s knowing that DKA is by definition an emergency, not a variant of hyperglycemia. It’s thousands of clinical rules that no foundation model ships with out of the box. Anyone can build a health chatbot. The market has made that clear. Building something a patient can take seriously when the stakes are real is a different problem entirely. It’s slower and harder in the short term. But it’s the only version that matters. The paper calls for premarket safety evaluation of consumer health AI. We think that’s the floor, not the ceiling. submitted by /u/BoysenberryMelodic96
Originally posted by u/BoysenberryMelodic96 on r/ArtificialInteligence

