Original Reddit post

I built memebench, a benchmark site where LLMs get real daily news headlines, generate memes using Imgflip templates, and people vote A/B style without seeing which model made which meme. It’s here: https://memebench.net/ Right now it benchmarks 20 recent major models, including GPT-5.5/mini/nano, Claude, Gemini, Grok, and others. Headlines come from a few dozen RSS feeds, get processed daily by an AI pipeline, and I sometimes do a manual pass over the shortlist before generation runs. But even if I don’t, the whole system, including the headline selection mechanism, is fully automatic. A lot of the results are kinda bad. Some I personally find genuinely funny, which is basically why I kept building it. The leaderboard is disabled until there are enough votes to make it less meaningless, because right now, it’s basically just my votes over the past ~2 weeks of development. The repo is public under MIT . You also find a more in-depth writeup on how the benchmark works exactly there too. This started with me playing around with OpenRouter and trying to get LLMs to generate actually funny memes. A few weeks later and here we are. All feedback welcome of course :) submitted by /u/thegentlecat

Originally posted by u/thegentlecat on r/ArtificialInteligence