Original Reddit post

Been shipping voice agents into production across restaurants, beauty salons, dental clinics and HVAC for the last 8 months. The failure modes are not what demo videos show. Sharing what we measured across roughly 200k handled calls in EN and ES. The scheduling logic problem is bigger than the voice problem. Most people obsess over latency, voice naturalness, interruption handling. Those matter. But the thing that actually breaks bookings is the AI not understanding dependencies in the calendar. Real example from a salon: client wants a balayage that takes 3 hours, stylist has a 2-hour gap. A simple bot books it and destroys the schedule. A slightly smarter bot says “I can’t help” and loses the booking. The right behavior is reasoning through alternatives: “Master A is busy but Master B can do this, and the service needs a wash station which is free at 4pm.” This isn’t a voice problem. It’s a planning problem wearing a voice interface. Most teams underinvest here because it’s not demo-able. Multilingual is where most products quietly fail. Spanish callers code-switch into English mid-sentence. French Canadian customers expect Quebecois phrasing. Catalan callers will start in Catalan and switch to Spanish if the agent doesn’t catch it. The “we support 20 languages” claim usually means the TTS speaks 20 languages and the LLM was trained on English. In production, that gap is brutal. We measured: an agent built English-first and “translated” to ES has 22% lower booking completion than one trained natively on Spanish call data. Same LLM family, same stack, different training distribution. Entity capture is the metric that matters, not WER. Vendors brag about word error rate. WER is a vanity metric. The real number is entity capture accuracy: did the bot get the phone number right, did it get the date right, did it get the service right. We see 94% general WER paired with 71% booking accuracy on the same call. Those are different failures and they need different fixes (custom vocab, confirmation loops, structured slot filling, redundant confirmation on high-stakes entities only). Owner-side editability is the under-discussed product problem. SMB owners want to tweak agent behavior daily. “Don’t take bookings for color services after 5pm.” “If they ask about gluten-free, say yes and mention the menu page.” Every product that requires a support ticket for this loses retention. Every product that gives owners full prompt access creates regression spirals where a small edit breaks something else two weeks later. The thing that works is a constrained editor: structured rules with guardrails, not free-form prompt access. Nobody has fully solved this. Cost ceiling is real and most pitches dodge it. A decent voice stack (STT + LLM + TTS + telephony) lands around 0.12 to 0.18 EUR per minute in 2026. That works for a salon with 60 EUR AOV bookings. It does not work for a pizza place taking 15 EUR orders. The honest answer is voice AI is not viable for the bottom tier of SMB ticket sizes. Most vendors will sell to them anyway and the unit economics quietly fall apart for the customer in month 3. Open questions I’m genuinely curious about: How are you handling confidence-based handoff to human? The thresholds drift as you change prompts and nobody has a clean re-calibration process I’ve seen. What’s working for evals beyond LLM-as-judge? Judge models miss the failures customers actually complain about. The complaint signal lags 2 to 3 weeks behind the prompt change that caused it. Anyone solved the “calendar reasoning” problem with a clean architecture? Most teams (us included) end up with a hybrid: the LLM proposes, a deterministic layer validates, but the seams show. Not pitching anything. Genuinely interested in how other teams at production volume are solving these. The public conversation in this space is unusually opaque about what actually works. submitted by /u/No-Zone-5060

Originally posted by u/No-Zone-5060 on r/ArtificialInteligence