Original Reddit post

There’s a gap in most “building agents” tutorials: they show you how to build an agent that works locally and call it done. The gap between “works locally” and “runs reliably in production” is significant and not discussed enough. Local agents: in-memory state, single process, you’re watching, easy to debug. Production agents: external state persistence required, need automatic retries on rate limits and timeouts, need observability so you know when they fail, need graceful handling of partial task completion if the process dies. The code changes are actually not that dramatic. The architecture changes are. Where does state live between steps? What happens when the model returns a malformed response? How do you handle the case where step 3 of a 5-step task succeeds but step 4 fails? You need answers to all of these before you put something in production. What’s been your biggest unexpected production concern with agents? State management, retry logic, observability, something else? submitted by /u/EastMove5163

Originally posted by u/EastMove5163 on r/ClaudeCode