I built an AI that watches livestreams and verifies if humans completed real-world tasks

Original Reddit post

Most AI use cases are about generating things. Text, images, code. I built something that goes the other direction. The AI watches a human doing a physical task on a livestream and decides if they actually did it. The backstory: there’s a platform called RentHuman where AI agents hire humans for physical tasks. Agent posts a job, human does it, gets paid. But the verification was just “upload a photo when you’re done.” That’s not real verification. So I built VerifyHuman as the missing piece. How it works: human accepts a task, starts a YouTube livestream, and does the work on camera. A vision language model watches the stream in real time. The agent defined conditions in plain English like “person is washing dishes in a kitchen sink with running water” or “bookshelf is organized with books standing upright.” When the VLM confirms conditions are met, payment releases from escrow. No human reviews anything. Won the IoTeX hackathon and placed top 5 at the 0G hackathon at ETHDenver building this. What surprised me: The VLM is good at understanding context, not just detecting objects. It knows the difference between “dishes are in a sink” and “person is actively washing dishes with running water.” That compositional reasoning is what makes this work. Cost is way lower than traditional video APIs. Google Video Intelligence charges $6-9/hr. The VLM approach with a prefilter that skips unchanged frames runs about $0.03-0.05 per session. Latency is the real limitation. 4-12 seconds per evaluation. Fine for watching a 10-30 minute task. Not fine for anything needing instant responses. The pipeline runs on Trio by IoTeX which handles stream ingestion, frame prefiltering, and Gemini inference. BYOK model so you bring your own API key. I think “AI that watches and judges real-world events” is going to be a big category. Insurance claims, remote inspections, quality control, security monitoring. The building blocks are all here now. What use cases do you think would benefit most from this? submitted by /u/aaron_IoTeX

Originally posted by u/aaron_IoTeX on r/ArtificialInteligence