I ran a pre-registered study measuring how much MMLU accuracy shifts when you change only the prompt format —not the question—across three open-weight models (Qwen-2.5-7B, Llama-3.1-8B, Gemma-2-2B), ~15,000 queries locally via Ollama. Ten templates, same items, temperature 0. Posting the methods lesson because it caught me, and I suspect it’s quietly inflating other people’s sensitivity numbers too. The initial headline was a STEM “flip rate” north of 90%—meaning an item’s correct/incorrect status changed across formats that often. Looked dramatic. It was partly an artifact. The problem: parse failures. When a model’s output doesn’t match your answer-extraction regex, you have to decide what that counts as, and that decision is rarely pre-registered. Strict parsers fail at different rates across templates —which is partly what “format sensitivity” even is. So if a parse failure counts as a flip, your sensitivity metric is partly measuring your own regex’s brittleness, not the model’s. What I’d flag for anyone running this kind of eval: Pin the parse-failure rule before you run anything. We pre-registered “parse failure = incorrect” (it reflects real-world usability) and reported per-template parse-failure rate as a separate secondary outcome. Deciding this after seeing the data is a researcher degree of freedom that can move the headline number. Report a parse-corrected metric alongside the raw one. We added a robustness metric that separates “the model changed its actual answer” from “our parser choked.” The STEM effect largely survived the correction—but it was meaningfully smaller than the raw number, and the raw number is the one that would’ve gone in an abstract. Per-template parse-failure rates belong in the paper , not a footnote. If template A fails to parse 4% of the time and template B 22%, that gap is doing visible work in any naive flip metric. Temp 0 + a fixed seed does not guarantee bit-identical outputs across hardware/runs in Ollama, so log raw completions as ground truth rather than trusting reproducibility of the parsed labels. The broader point: format-sensitivity is real and worth studying, but a chunk of the scariest published numbers may be extraction-pipeline brittleness wearing a model-behavior costume. Separating the two requires committing to the parsing rule in advance. Happy to get into the template design or the super-category stats (we used MMLU’s four super-categories for the confirmatory test, not all 57 subjects—cell counts too small) in the comments. submitted by /u/Magayone
Originally posted by u/Magayone on r/ArtificialInteligence
