Original Reddit post

I added tested Gemini 3.5 Flash and ran it through around 10 saved evals I use for model selection decisions in production. So far, the result is not what I expected. On most of my tasks, Gemini 3.5 Flash underperformed older Gemini variants. In the screenshot below, this is a vision emotion-detection eval with 5 runs per model: In, this eval it ended way down at 13th place, even though 3.1-pro and 3.1 flash lite are top 1 & 2, its even lower than gemini 3 flash actually. Its 10x more expensive than flash lite for a worse result. Its an avg result of 5 runs so its not a one time fluke. On top of that, this is 1/10 benchmarks with similar outcomes, although admittedly this is one of the worst case. https://preview.redd.it/e87e67lm752h1.png?width=2750&format=png&auto=webp&s=93e7820e8d6f5cc832c0b756ed27ff00f2c21ae9 I ran this via an online benchmarking tool . Not claiming this means Gemini 3.5 Flash is bad universally. These are my saved evals, and Gemini and any models can be prompt-sensitive. But for my workflows, these benchmarks unfortunately indicate that I can’t use it as is. I really hope that this is something that will change, because I had high expectations for this model given their previous release. To me it just goes to show that artificial analysis and other generic benchmarks can really be misleading when it comes to model decisions. From what the results they were showing I was expecting much better… submitted by /u/Rent_South

Originally posted by u/Rent_South on r/ArtificialInteligence