Original Reddit post

I spent the last few hours playing around with the Qwen3.5-Omni model that launched today. To be honest, I was skeptical about the “Audio-Visual Captioning” claims, so I gave it a real stress test by uploading a raw, pitch-black video filmed in a forest in Poland.Most models I’ve used would just see a dark blob, but this one managed to generate a full 18-shot script-level breakdown with millisecond timestamps.What really caught me off guard wasn’t just the summary, but the granular details it picked up in near-total darkness. It accurately identified a person cupping water in their hands, mentioned the specific color of their nails, and even picked up the subtle sound of tent stakes hitting the ground.It supports a 256k context window, which supposedly handles up to 10 hours of audio or 1 hour of video. The technical brief mentions it beats Gemini 3.1 Pro on pure audio tasks, and after seeing it transcribe foreign voiceovers perfectly in this dark footage, I’m starting to believe it.Has anyone else tried pushing its limits with really long or low-quality footage yet? I’m curious if this level of accuracy holds up over a 30-minute clip. submitted by /u/GharKiMurgi

Originally posted by u/GharKiMurgi on r/ArtificialInteligence