tl;dr I’m an independent researcher and this is my first paper. I spent the last couple of months measuring whether a single LLM is actually good enough to review code on its own, or whether you need a few different ones. I sense through anecdotal observation that I was getting significant returns by using a mixed set of LLM for parallel code reviews. I always output the details of every code review from each individual reviewer and I also document which are legitimate findings and which are not. That combination of data provided me with what I needed to perform the analysis. Short version: one model misses a lot. Full paper is here: https://doi.org/10.5281/zenodo.20519584 I’d really appreciate people picking apart the methodology, and if anyone here can endorse on arxiv, I’m trying to get this posted to cs.SE and could use a hand. The setup: a software team ran every code review through 2 to 4 different LLMs separately, then a human went through and reconciled all the findings into one list of what was actually wrong. I used that as the answer key and scored how many of the real, confirmed defects each model caught. 18 code artifacts, 154 confirmed defects, 8 model versions across 5 providers. What I found: No single model got above about 64% recall on the confirmed defects, and a typical one caught roughly half. Over half of the defects (56.5%) were caught by only one of the models. They mostly weren’t finding the same bugs (median overlap was about 0.37 Jaccard). Adding providers one at a time, coverage went 33.6% with one, 57.1% with two, 74.6% with three, 88.7% with four. The biggest single gain is just adding a second model from a different provider. The practical version: don’t lean on one model for code review. Run two or three different ones independently, have a human reconcile the results and check them against the actual source, and expect somewhere around half to two thirds for any single model. What I’m hoping for: Feedback on the method and the stats (recall with Wilson intervals, the Jaccard overlap, the coverage curve). Tell me what’s weak. An arxiv endorsement. As a first-time submitter I need one already-published author (3+ cs.* papers in the last 5 years) to endorse me for cs.SE. Takes about two minutes, and you’re not vouching for the paper, just that I’m a real person. If you’re open to it, comment or DM and I’ll send my code privately. Happy to let you read the paper first. submitted by /u/qu1etus
Originally posted by u/qu1etus on r/ArtificialInteligence
