I’ve been following vibe coding output for a while and the way people evaluate it is broken. Big claims disappear behind code dumps. There’s rarely a measurable outcome, most of it is hype and speculation, and how well the tools scale on real codebases varies wildly depending on who you ask. The people who say they shipped something don’t share the process. They optimize for sensational headlines and skip everything that would let you grade the work. Testing a random app, a SaaS dashboard, or a website tells you almost nothing about model quality. They all converge on the same look, or they bolt on a useless 3D scene to seem impressive and tank performance doing it. You’re grading templates, not the model. Vibe Your Way Here Games are what’s left. A game is the cleanest test I can think of for current AI: visuals and mechanics get exercised at the same time, and you can grade the result at a glance. You don’t need anyone to walk you through their process, because a game is the sum of a lot of moving parts, and even someone who has never touched gamedev can feel whether it’s any good. So I wanted to see how far I could push current models. One month, working web tycoon game, runs in the browser. The premise leans into the joke: it’s a tycoon where you run a vibe-coding studio, shipping the same small projects vibe coders rebuild for the thousandth time, habit apps, todo apps, that whole genre. Which is what vibe coding actually is in practice: burning tokens to redo solved problems and hoping the model makes smart choices in the middle. Stack: Cursor (GPT-5.4 high) for almost all the coding, Gemini 3.1 for assets, Claude Opus 4.6 for specific refinements like lighting. Nothing else. I do not normally believe that one trivially simple trick changes the outcome of a real project. The “one quote that changed my life” genre is nonsense to me, and I’d be skeptical reading this if someone else wrote it. But AI work is structurally different. The medium is effortless generation and slop, and small process choices seem to compound far more than they should. The trick: Gemini in Canvas mode, one-shot. Gemini is mediocre at coding and at most other things, but in Canvas, asked to one-shot something visual or stylistic, the outputs are surprisingly strong, and the art styles you can pull out of it are ones the other frontier models simply won’t give you. I assume that’s downstream of training data. The method is: open ten tabs of gemini 3.1 canvas, run the same prompt in parallel, pick the one that hits, iterate on it with the other models. That’s the whole thing. Every visual decision in the game went through that loop: the main city scene, the UI, the juicy micro-animations, the three.js offices. Ten variants, pick the strongest, hand the winner to Codex to wire it into the project, then sometimes pass it through Opus for refinement (lighting was the big one). The selection step is doing more work than people give it credit for. Most of the gain isn’t any individual model being smart. It’s refusing to settle for the first output. Run wide, select aggressively, integrate with Codex. One more thing everything you see in the game is 100% AI generated. No external assets, no asset packs, no stock art. The only exceptions are a few AI-generated images and some AI-generated 3D robots. submitted by /u/Feisty_Advantage_597
Originally posted by u/Feisty_Advantage_597 on r/ArtificialInteligence
