eifachposte

eifachposte

Every model gets the same brief: build one small but complete web app from a single detailed spec, then it is graded the same way. The task deliberately spans several areas at once, so a top score needs all of them working together: A web service — accept requests and return the correct responses. Stored data — save information and read it back reliably. A cache — reuse recent results and refresh them when the data changes. Activity logs — record what happened, in the required format. A web page — a working interface people can use in the browser. Reliability and safety — stay correct under many requests at once, and guard against common security holes. Scoring is by automated tests plus independent AI judges. Higher scores are better. How to read this table Implementer — The AI model that wrote the code. Helper — A second AI model that reviewed the code and gave feedback between tries. Evaluator — The AI model that graded this run’s code quality. Gate — What decided the run was finished. There are three kinds: completion-cmd — Stops as soon as the automated tests pass; the helper only steps in if they fail. completion-cmd-advisory — Tests must pass and the helper-reviewer must also approve before it stops. promise — No tests; the helper-reviewer alone decides when the work is done. Iters — How many write-then-review rounds the run took. Walltime — How long the run took, in minutes. Score — Final quality grade as a percentage (out of 90 points; higher is better). Run settings All runs share the same harness setup: Same task — every model builds the same app from the same detailed spec. Max rounds — up to 5 write-then-review iterations (a run can stop earlier; see Gate). Time cap per call — up to ~90 minutes per model call, so slow, heavy-reasoning models can finish. Pause between rounds — 10 seconds. Retries — up to 3 attempts per call; the run stops if 3 rounds fail in a row. Scoring — 4 independent AI judges grade the final code on a 90-point scale; the table shows the lowest (strictest) of the four. Results submitted by /u/lrsaturnin9

Originally posted by u/lrsaturnin9 on r/ClaudeCode

GLM 5.2 personal benchmark. Results comparable with Fable, Opus 4.8, and GPT 5.5

GLM 5.2 personal benchmark. Results comparable with Fable, Opus 4.8, and GPT 5.5