I’m the solo developer — built this over a weekend for Google’s Gemini Live Agent Challenge hackathon. Uploaded a photo of Mount Kilimanjaro. The AI identified it as a dormant stratovolcano, described its geological history, then generated an image of the volcanic eruption that built it—and another showing what the mountain might look like after thousands of years of erosion. Technical breakdown: The pipeline chains 3 Gemini models sequentially: Gemini 2.5 Flash receives the image and a persona prompt. It identifies the location, rock types, flora, and geological era — then writes a narration in a “park ranger storytelling” voice rather than a factual summary. Location identification is grounded in Google Search for accuracy. A second Gemini 2.5 Flash call takes the identification data and selects the most visually dramatic geological era for this specific location. It outputs a JSON with a scene description — this is the key architectural decision. Sending the raw narration (which mentions “magma” and “molten rock”) directly to the image model consistently produced generic lava. Separating era research from image rendering fixed this completely. Gemini 3 Pro Image Preview takes the clean scene description and generates a photorealistic landscape using an interleaved TEXT+IMAGE output modality. The same pipeline runs twice in parallel using asyncio.gather — once for past, once for future projection. Total latency ~30-45s for both images. Gemini 2.5 Flash TTS converts the narration to natural speech. Limitations:
- Image generation fails ~10% of the time — built a 3-model fallback chain (Pro Image → 3.1 Flash Image → 2.5 Flash Image)
- Geological accuracy depends on Gemini’s knowledge — it occasionally gets specific dates wrong by tens of millions of years
- No offline support — needs a network for all AI calls
- Progressive loading helps, but the full pipeline still takes 30-60 seconds Lessons learned:
- Two-step generation (text plans the scene, image renders blind to geology terms) dramatically improved image quality
- Persona prompting (“campfire park ranger”) vs generic instructions (“describe geology”) produces 10x more engaging output
- Progressive disclosure is essential — show narration at 15s, load images in the background Stack: FastAPI on Google Cloud Run, Next.js frontend, Google GenAI SDK (Python) Repo: https://github.com/KrishnaSathvik/hackathongoogle Live: https://trailnarrator.com/ submitted by /u/peakpirate007
Originally posted by u/peakpirate007 on r/ArtificialInteligence
