We’ve been messing around with the new OpenAI realtime voice + translation APIs over the last little while and I keep coming back to the same thought… I don’t think people fully get where this is going yet. We wired it into our own website as a test. Nothing fancy. Just wanted to see what actually breaks when you let people talk to a site instead of click through it. At first I thought it would just feel like a slightly better chatbot. It doesn’t. Once I hooked it into tools and gave it the ability to actually do things (we’re using the Agents SDK + Playwright for web browsing and control by a sub-agent), the whole interaction changed. I can literally just talk to the site like I would talk to a person and it can move around, pull info, trigger actions, and respond in context. I wanted a layer that that could navigate and respond by just talking. I know that sounds obvious, but it’s not how websites are designed at all. Ours certainly was not. A few things that have been interesting (and honestly a bit brutal) is how quickly this exposed weak structure. Our content was vague… so if your metadata sucks, if your pages are bloated or unclear… voice didn’t let us hide behind a pretty UI design. The model just struggles or gives bad answers immediately. There’s no masking it with a nice UI. Latency has improved way more than I expected with the new voice model API. Before, when someone was talking, even small delays felt awkward. The new Realtime 2API tolerates those pauses wonderfully. We also started playing with the realtime translation side and that also feels like a bigger deal than it’s getting credit for. Not in a “multi-language support” way, more like… you just speak however you want and the system handles it. No toggles, no switching context. It’s subtle but it completely changes the feel. Our website is language agnostic. (13 supported languages using the Realtime 2 API) The bigger shift for me seems to be changing the way I want to think about websites and interactions. People don’t think in menus. They don’t think in pages. They don’t think in navigation. They think by intent and the second I added voice, i was forced to deal with that reality whether our website system was not ready. Great learning lesson. My Takeaway so far: Right now most of what I’m hearing and reading, people/businesses treats voice like a feature. Like and Add-on. Cool. Nice to have. Unsure if its practical. I don’t think that’s where this ends. I think this starts pushing toward systems you can just interact with directly. Personal assistants that actually execute. Internal tools you can talk to. Intake flows that don’t feel like forms. Stuff like that. Minimal website visuals. More dynamically displayed content based on interpretation of user intent. [Basically a cool wave form that animates differently depending on interaction stage] No direct site content visually. We’re still early and there’s definitely some friction [writing a second voice prompt on top of the text prompt so there is parity between our text chat and voice chat, but I’m pretty bullish on this direction - Guardrails, Rate-limits, Prompt Injection…]. Curious if anyone else here is actually building with it yet and what you’re running into. Feels like we’re right on the edge between “cool demo” and “this changes how software works,” and I’m not sure which way most people are approaching it yet. submitted by /u/Early-Matter-8123
Originally posted by u/Early-Matter-8123 on r/ArtificialInteligence
