Original Reddit post

I think we have enough benchmarks that really don’t tackle the real experience of what it means to use AI when programming. This test isn’t by all means standardized and I would’ve come up with a better test than both approaches if I had the time/patience or it gave me something good in return. Nonetheless I think people focus too much on benchmarks tho reviews like this are abundant. Nonetheless I think I may have something to contribute. I use AI sometimes for prototypes for programs that I might want to create/maintain in the future, mostly vibe-coded, then I check everything manually and rewrite, it’s arduous but I prefer it for when you have to experiment and e.g. change each feature of the frontend 4 times because they interact in ways that are inconsistent/annoying or maybe I want to test another feature on the backend. For other stuff they are good at finding bugs or answering questions like (“where does x come from”) or (“where is x”) or (“look for every component from which x variable passes and add a console log”)…etc, classic assistant AI stuff instead of vibe-coding which is the first case. Most of the review is going to be based on vibe-code, both are great as assistants tho I’d say that Claude gains when it comes to research which you’ll see is a recurring theme of the review. As for my experience with prompting I’m quite proficient, I’ve been doing it for 4 years and apparently doing inadvertently what now are considered good practices (except for writing stuff on CLAUDE.md or AGENTS.md , I had to know about that feature through posts). I also have a good record of finding vulnerabilities on them generating partial or full jailbreaks (mostly partial because I tend to need specific things for cybersecurity). I also have been programming for more than a decade so I think my opinion might be relevant to some people. Anyway, I’ve been using Claude Code (all models) and codex 5.3 EXTENSIVELY and by that I mean roughly 2000-4000 prompts depending on how you calculate that. I have used them on 2 projects at different stages of production, sometimes alternating between both if a feature was particularly annoying to vibe code. The stack was react with nodejs (and the usual tailwind, postgres/sqlite, docker, redis…etc, classic webdev stuff). The apps are a social network and a sample organizer (music, look for sononym and XO sample and you’d get roughly an idea, it’s like a fusion of both). Anyway, let’s get to the models. Claude overall feels more concise tho it can get stuck into thought loops just to resolve them in a trivial way on the next prompt or on a new conversation. Codex can start hallucinating or even change the task if it stays running for a decent amount of time, context rot is bad for both but I’d say that Codex is the one that gets hurt the most. From time to time it also gets stuck tho its thinking process isn’t as transparent so Idk what’s going on inside there. Advice: on both models you’d want to stop them if they are stuck and just tell them “continue with your previous instruction” so they get out of the loop. Doing a /compact from time to time is mandatory, specially for Codex, I’d say that 30% of the context window is more or less the safe limit on both models. Sometimes you need more and I feel like Claude tends to handle it better tho Codex can sometimes be surprisingly good. While codex has its moments of brilliance, I’d say that Claude is more consistent (it fails from time to time like all models, it’s not bulletproof). Claude (Opus and sonnet) seems to have a more cautious approach, tends to make less presuppositions and seems more “formal” overall. Due to the better research capabilities it works better when you are trying to implement something unorthodox tho every model has its places where it gets stuck where others have no problem, Claude is no exception. Claude also seems to allow for a more strategic choice of models while the models on codex seem like “nerfed” versions of the prime one (5.3). Sonnet is more useful in certain scenarios where Opus might overthink or be glacially slow. Haiku is decent with thinking activated, sonnet 4.6 with thinking mode should get you through 90% of the tasks. I didn’t experiment that much with this part on codex because I ended up having bad experiences whenever I switched to “lower quality models”. For riskier tasks that require a lot of effort like huge refactoring I’d recommend Opus over anything else. It can “get it” after 2/3 iterations with some small debugging here and there. Codex’s quality can improve substantially with addons, same can be said for Claude. With addons/extensions/plugins…(however you want to call them) unsurprisingly Codex becomes a beast (some vercel ones come to mind). I’m undecided when it comes to comparing these capabilities tho you can’t use them separated from the rest of the pipeline of the model, if there’s a different it’s not significant enough for me to realize it. I haven’t plugged them into other tools so no review on my part in here. My testing on deploying “swarms” of agents on a single prompt feels better with Claude due to its better planning capabilities tho I might be biased. That being said that is going to hurt your wallet if you decide to go for Anthropic. Codex wins cost wise which seems to be like a recurring theme. Both models work better as a VSCode extension (I guess whatever editor you use would be the same) due to the more limited file-view, they become more focused on what they need to do so open whatever specific file you need to edit and save yourself a ton of tokens/time. I wouldn’t recommend using any of them with Cline, it tends to get stuck way too frequently. As a final review Claude is better when it comes to planning, Codex is a lot better when it comes to price HOWEVER be careful because it might use more tokens than what it needs to to perform a simple task, just like it happens with Opus. We are at a pretty decent state rn, they are good for vibe-coding prototypes that don’t need polishing or small functionalities and work very good as assistants… (rant incoming) …however I don’t understand why owners of companies are even considering firing people because the tools are completely unable to replace a human. They might boost the productivity of one SE if they are using it carefully but a company environment isn’t the same as “I’m vibe coding a prototype for a side project that will make my life easier”. I think it’s total insanity and they should put their feet back on earth. I don’t want to give ideas but given that AI produced code seems to be less secure and some of the people fired might know a thing or two about cybersec (the entry barrier has definitely been lowered) chances are some are going to do no so nice things but that’s just my speculation. If the job market is saturated and you have the knowledge to do that it doesn’t take that many people to be put into that situation for it to escalate and become a problem. Whatever, I just needed to rant about this practice because I think it’s unethical, delusional and completely irresponsible (ignore the revenge hacking stuff, that’s just pure speculation), let’s hope for the best that this isn’t another scenario where the “good enough”(10 engineers vibecoding) replaces the “best”(having 50 engineers with assistants/tools), that’s just unfeasible. Have a good one. submitted by /u/Velascu

Originally posted by u/Velascu on r/ArtificialInteligence