building a voice AI for restaurants and salons for the last 6 months. wanted to share some technical reality vs the “800ms latency” demos everyone shows. what nobody talks about: latency is bimodal, not average. demos show median latency. real users churn on the p95. our median is ~800ms, p95 is 2.4s. that p95 is what determines if the agent feels human or broken. it comes from rare edge cases: model retry on malformed function call output, slow tool execution (calendar lookup against a slow third-party API), VAD misfires on background noise. interruption handling breaks more often than the conversation itself. users interrupt the agent constantly. naive VAD treats every cough or background noise as interruption. we ended up with a 3-layer system: VAD signal + semantic check (is what they said actually a continuation?) + acoustic energy threshold. still wrong maybe 5% of the time. function calling reliability degrades with prompt length. with system prompt under 1.5k tokens, function call accuracy is 96%. above 3k tokens, drops to 84% on the same model. nobody tells you this when you stuff personality, business rules, and few-shot examples into one prompt. TTS choice matters more than LLM choice for perceived quality. users complain about robotic voice 10x more than about wrong answers. swapping LLM from GPT-4 to Claude or Gemini moved business metrics 2%. swapping TTS from generic to ElevenLabs Flash moved booking conversion 14%. multilingual is a tax on everything. we support 50+ languages. each language adds: separate TTS voice tuning, separate VAD calibration (some languages have more sibilants which confuse VAD), separate few-shot examples in the prompt. cost per call in Russian is ~40% higher than English purely because of these calibrations. anyone else running voice agents in production? curious what your p95 looks like and how you’re handling the multilingual cost explosion. submitted by /u/No-Zone-5060
Originally posted by u/No-Zone-5060 on r/ArtificialInteligence
