Original Reddit post

So Salesforce built CRMArena-Pro as part of a research study and launched it this time last year (Salesforce Research, arXiv 2505.18878). They tested leading agents on real CRM work across 4,280 queries and 9 top models. The results were ~58% success on single-turn tasks, dropping to ~35% the moment the task went multi-turn. I know this seems irrelevant news considering its age and how fast the AI space is moving but I believe the results will either be the same today for AI agents or improved by marginal points meaning companies running AI agents will still pay for the errors their agents make. The expensive realization is that it’s not really the model, it’s the length of the autonomous runs. The longer one agent runs, the more early context it loses and the more small errors compound. Leading AI players are combating this problem by expanding context windows to handle upto 1 - 2million tokens, introducing various frameworks for memory saving and also introducing tools like RAG. The problem still persists especially for longer running agents and workflows which require 24/7 activities. This is not something I’m selling but what’s working for me in production is actually pretty boring but reliable. I don’t hand one agent the whole job. Instead I break the overall job into narrow stages, one job per stage, with a plain-text handoff between each stage and a checkpoint where I can verify outputs before the next stage runs. A stage that researches doesn’t also write. A stage that writes doesn’t also send and the agent never has to run long enough to drift. At every stage, I am able to identify bad outputs before they compound. Has anyone else found that scoping context per stage beats prompt-engineering one mega-agent? Where’s the line for you between “split into stages” and “this can run end-to-end”? submitted by /u/Sensitive_Judge_5502

Originally posted by u/Sensitive_Judge_5502 on r/ArtificialInteligence