Original Reddit post

I’m interested in building apps where voice, sound, or listening is a core part of the experience, not just an add-on. For people who have experimented with this: how are you getting high-quality audio output in vibe-coded or AI-assisted apps? A lot of current LLM workflows seem to rely heavily on TTS engines, and that feels like a bottleneck. Even if the text generation is strong, the final voice/audio experience can still feel flat, unnatural, or low quality. I’m curious about: What models or engines are people using for voice-first apps? Are there better approaches than simply connecting an LLM to a TTS API? How do you prompt or structure the system to get more natural, expressive, or context-aware audio? I know that a lot of LLMs were trained on speech banks but their own produce lacks the same quality in speech delivery. Would love to hear what people have tried, what works, and where the current limitations are. submitted by /u/Only-Vegetable8616

Originally posted by u/Only-Vegetable8616 on r/ArtificialInteligence