I’m trying to write an agent that analyzes a video and writes a detailed narrative of what’s happening, including dialogue. Here’s what I did. Dialogue part is easy just transcribe audio by timestamp. Video part is hard. Here’s what I’ve done so far: Distill video into n frames per second. I went with 5 at first. Load each frame and verbally describe what’s going on. Compare the verbal summaries of the frames to see what changed from frame to frame. Take all of the above information and write a detailed narrative of what happens, the way our brain does when we watch a movie or just look around at the world. The problem I’m running into is that: It takes a ton of processing time to analyze one frame. A picture is worth a thousand words as they say. Even if you do that, it’s hard to identify the material things that changed from frame to frame. You have to separate material changes from immaterial ones. It’s amazing that our brain does these first two steps like 60 times a second or whatever. what the actual f*** honestly. I had the idea of replicating whatever Elon does for self driving, since he has to do the above to “see” the road, but I learned he doesn’t do that. He distills the world into a few polygons first and then sees those. Completely different thing. A picture is worth a thousand words but a few polygons definitely are not. Any ideas wise AI people? TIA. submitted by /u/Hennen_Crus
Originally posted by u/Hennen_Crus on r/ArtificialInteligence
