AI voice cloning has gotten good enough that synthetic audio can fool call center agents into thinking they’re talking to a real person. A recently granted patent from a voice security company breaks down how production deepfake detection actually works at scale, and the approach is more interesting than I expected. The system runs four signals simultaneously on every inbound call. Signal 1: Your response timing is suspicious even if your voice isn’t Every time an agent finishes speaking, the system measures how long it takes the caller to respond. A single delayed response doesn’t flag anything. What it’s looking for is the statistical pattern across the entire call — variance, interquartile range, mean response time. AI-generated speech has processing overhead that creates a latency signature humans don’t have. The system also adjusts what counts as “normal” based on what’s being asked — a simple yes/no prompt gets a different baseline than a complex verification question. Signal 2: Asking you the same question twice The IVR deliberately repeats questions. A human repeating themselves sounds slightly different each time — different cadence, slight pitch variation, different word choice. Pre-recorded or AI-generated audio repeated twice is near-identical. The system scores acoustic similarity between both responses. Fraudulent callers cluster around 90-95% similarity. Humans around 30-50%. Signal 3: The background noise doesn’t match Real phone calls have a consistent ambient signature throughout — the same background noise, the same reverberation, the same signal-to-noise ratio from start to finish. When a fraudster switches from their real voice to playing synthetic audio mid-call, the background profile shifts. Noise type changes. Reverberation changes. The system runs a continuous classifier on non-speech audio to catch exactly that discontinuity. Signal 4: Fingerprinting how the audio was generated This is the most technically interesting part. Just as a camera sensor leaves a unique noise fingerprint on every photo it takes, speech synthesis systems leave artifacts in the audio they produce. The system extracts an embedding specifically trained to capture those synthesis artifacts, then maps it to a liveness score. When a new type of attack gets flagged by a human agent, the model partially or fully retrains depending on how novel the attack is. Why use all four instead of just the best one? Because each signal breaks in isolation once attackers know about it. Add pitch variation between repetitions and you beat Signal 2. Record in your actual environment and you beat Signal 3. Use a TTS architecture the model hasn’t seen and you beat Signal 4. Defeating all four simultaneously is a meaningfully harder problem. The part that’s actually hard to replicate The architecture here is reproducible. The signals are well-documented, the model choices are standard, and you could build a functionally similar system without infringing on this patent. The hard part is the data. The company behind this has processed over a billion calls with continuous human feedback on novel attack types. That training signal isn’t something you reconstruct from scratch. The patent describes the lock. The data is the key — and it doesn’t appear anywhere in the filing. The open question Detection models trained on controlled datasets consistently drop 40-50% in accuracy on real-world audio. The synthesis fingerprinting approach has the same exposure: it’s trained on artifacts from older vocoder-based TTS systems, and newer codec-based speech generation works through a fundamentally different mechanism. Early evidence suggests codec-based voices are 20-30% harder to detect than older approaches. The patent was filed in November 2023. Production codec-based TTS is largely a 2024-2025 development. Whether the detection holds up across that architectural gap is genuinely unclear. Sharing this from going through patent filings regularly. submitted by /u/Leather_Carpenter462
Originally posted by u/Leather_Carpenter462 on r/ArtificialInteligence
