Original Reddit post

Genuine question that turned into a small rant, so bear with me please. We’ve been running a few MCP servers feeding tools to an agent in production for a few months and the debugging experience is the part nobody warned me about. When a normal REST call fails you get a status code, a stack trace, something. When an MCP tool call goes wrong the failure modes are weirder and quieter. The ones that have bitten us: the tool succeeds with no error but returns something subtly malformed, and the model just incorporates the garbage and keeps going, so you find out three steps later when the final output is wrong. Or the model decides not to call a tool it should have because the description was slightly ambiguous, and there’s no error anywhere because nothing technically went wrong. Or an intermittent timeout hands the model a partial result and it confidently fills in the rest. Or the call is fine but the model passes the wrong arguments because two tools had similar descriptions. The common thread is that none of these throw. Your normal logging and APM see a successful agent run. The badness is semantic, not in the status codes. What’s helped us, roughly in order of impact: logging the full tool call, meaning the name, the exact arguments the model chose, the raw response, and the reasoning step around it, because the arguments-the-model-chose part is where most of our bugs actually lived. Treating the whole run as a trace with each tool call as a child span, so I can see the sequence and where it diverged instead of grepping flat logs. We use Langfuse for that mostly because it ingests OpenTelemetry and we didn’t want a parallel logging stack, though anything with span-level tool visibility would do the same job. And an eval that checks tool outputs for shape, not just whether the call returned, so the malformed-but-successful case gets caught at the source. But I still feel like we’re improvising. So the real question for anyone running MCP in anger: what does your tool-call debugging actually look like? Is anyone doing something smarter than “trace everything and read it after the fact”? I’m specifically curious how people catch the silent-wrong-output case before it reaches a user. THANK YOUUU!! submitted by /u/Total_Listen_4289

Originally posted by u/Total_Listen_4289 on r/ClaudeCode