Spent the last 18 months building a voice and conversational AI platform deployed in production for service businesses. Sharing concrete observations because the gap between voice AI demos and voice AI in production is wider than most public discussion admits, and I wish someone had documented this when I started. Context Production deployments across restaurants, hospitality, HVAC, dental, and e-commerce support. English and Spanish in production, architecturally 20+ languages. Five channels sharing the same orchestration and conversation state: voice calls, WhatsApp, Instagram, web chat, email. Built our own voice pipeline rather than wrapping Vapi or Retell, because the cost structure didn’t survive customer pricing otherwise. What broke first Names. Speech-to-text engines that hit 95% accuracy on benchmark datasets dropped to 65-72% on real customer phone calls. Spanish names in California, eastern European names in trade services, accented English with background noise. Every misheard name was a customer who felt unheard. Rebuilt our name handling pipeline three times before it stopped being the top complaint. Time references. “Tomorrow morning” means 8am to a contractor and 10am to a customer. “Around 3” gets logged as 3:00 sharp. The number of edge cases in natural time parsing across cultures and trades is much larger than off-the-shelf libraries handle. Every booking error from time misinterpretation cost the operator real money. Interruptions. When a caller jumps in mid-sentence, the system needs to know whether they’re correcting, agreeing, or asking a new question. Getting this wrong feels worse than slow response time. Operators told us callers prefer waiting an extra half-second to being talked over. Silence handling. A 4-second silence in a phone call feels eternal. Cutting in too aggressively makes the system feel pushy. Right pause length varies by vertical. Restaurant callers tolerate longer pauses than HVAC emergency callers. We tune this per use case. The economics nobody discusses honestly Most voice AI platforms advertise base price per minute somewhere between 5 and 15 cents. What’s hidden: the base rate excludes prompt tokens, conversation context, function calls for business logic, knowledge base retrieval, voice cloning, and routing. By the time you stack what an actual production deployment needs, real cost lands at 15-25 cents per minute. For a small business doing 1500 minutes of calls per month, that’s $250-400 in raw infrastructure before margin. The business can usually afford $200-300 a month total for the solution. The economics don’t survive contact with the customer. This is why most voice AI deployments aimed at SMBs quietly die after 6 months. The model worked in the pilot when the founder was eating the cost. It stopped working when someone tried to make money on it. What surprised me about operators They care less about the AI sounding human than I expected. They care a lot about the AI being predictable. An operator can train their team around “the AI always asks for callback number before transferring.” They cannot train around “the AI sometimes does X, sometimes Y.” They want logs, not magic. The operators who renewed were not the ones impressed by the demo. They were the ones who could pull up a transcript at 9pm and understand exactly what happened on a missed call earlier that day. They quietly modify their own scripts after launch. Within two weeks of deployment, almost every operator was suggesting changes to greetings or specific scenario handling. The product became collaborative whether we designed it that way or not. The ones who got value were the ones we built self-edit tools for. The ones who churned were the ones who waited for us to make changes. What still keeps me up How to handle multilingual scenarios where the caller switches mid-call without latency spikes. How to keep the system useful when STT drops a critical word and the LLM confidently guesses wrong. How to make voice AI economics work for the bottom 60% of SMBs where the cost floor is currently too high. Open questions for anyone else building in this space How are you handling the cost-to-quality tradeoff at the SMB tier? The per-minute infrastructure floor is currently too high for the segment that needs it most. How are you measuring “the AI is good enough”? Demo metrics like response latency and STT accuracy stop predicting customer satisfaction once you’re in production. What’s your approach to the operator self-edit problem? Customers want to modify behavior without filing tickets, but giving them full prompt control creates new failure modes. Curious what others working on voice or any latency-sensitive AI have measured. This space has unusually opaque public conversation about what actually works at production scale, and I think it holds back honest discussion of what’s viable. (If you’re a builder or agency working in adjacent space, happy to compare notes directly. Not pitching, just genuinely interested in how other teams are solving the same problems.) submitted by /u/No-Zone-5060
Originally posted by u/No-Zone-5060 on r/ArtificialInteligence
