Original Reddit post

I knew text-based character chat was already working as a category — especially after seeing Character.AI take off, with founders who came from Google/LaMDA-type work. But it feels like the next step might be moving from text chat into real-time video interaction. I tried Mel recently, and the interesting part to me wasn’t just that it lets you talk to characters. It was the whole interaction stack: voice input, lip sync, camera-aware responses, facial reactions, and a video character that felt much less static than the usual avatar/chatbot setup. For example, if the user is visibly on a plane, the character can ask if they’re on a plane. If the user is in a bathroom, it can notice that context too. I’m not sure how much of the video is truly changing in real time vs. using some clever prebuilt animation/rendering system, but the lip sync was surprisingly good and the interaction felt more dynamic than most AI social apps I’ve seen so far. For people working on multimodal or agentic interfaces, what do you think is technically hardest here? low-latency vision understanding speech timing lip sync real-time avatar rendering memory/context making it feel unscripted instead of like a scripted NPC My guess is that the challenge is less about any single model and more about orchestration: keeping voice, vision, language, animation, and memory synced without making the whole thing feel delayed or fake. Do you think real-time video becomes a serious AI interface, or is it mostly a novelty until latency/animation quality improves? submitted by /u/DonutRare5633

Originally posted by u/DonutRare5633 on r/ArtificialInteligence