Original Reddit post

I’m a heavy Claude Code user, a Max subscriber, and I’ve been using it consistently for about a year, but in the last few days I’ve been running into a clear drop in output quality. I used Claude Code to help implement and revise E2E tests for my Electron desktop app. I kept seeing the same pattern. It often said it understood the problem. It could restate the bug correctly. It could even point to the exact wrong code. But after that, it still did not really fix the issue. Another repeated problem was task execution. If I gave it 3 clear tasks, it often completed only 1. The other 2 were not rejected. They were not discussed. They were just dropped. This happened more than once. So the problem was not one bad output. The problem was repeated failure in execution, repeated failure in follow through, and repeated failure in real verification. Here are some concrete examples. In one round, it generated a large batch of E2E tests and reported that the implementation had been reviewed. After I ran the tests, many basic errors appeared immediately. A selector used getByText(‘Restricted’) even though the page also contained Unrestricted . That caused a strict mode match problem. Some tests used an old request shape like { agentId } even though the server had already moved to { targetType, targetId } . One test tried to open a Tasks entry that did not exist in the sidebar. Some tests assumed a component rendered data-testid , but the real component did not expose that attribute at all. These were not edge cases. These were direct mismatches between the test code and the real product code. Then it moved into repair mode. The main issue here was not only that it made mistakes. The bigger issue was that it often already knew what the mistake was, but still did not resolve it correctly. For example, after the API contract problem was already visible, later code still continued to rely on helpers built on the wrong assumptions. A helper for conversation creation was using the wrong payload shape from the beginning. That means many tests never created the conversation data they later tried to read. The timeout was not flaky. The state was never created. So even when the root cause was already visible, the implementation still drifted toward patching symptoms instead of fixing the real contract mismatch. The same thing happened in assertion design. Some assertions looked active, but they were not proving anything real. Examples: expect(response).toBeTruthy() This only proves that the model returned some text. It does not prove correctness. expect(toolCalls.length).toBeGreaterThanOrEqual(0) This is always true. Checking JSON by looking for { That is not schema validation. That is string matching. In other words, the suite had execution, but not real verification. Another serious problem was false coverage. Some tests claimed to cover a feature, but the assertions did not prove that feature at all. A memory test stored and recalled data in the same conversation. The model could answer from current chat context. That does not prove persistent memory retrieval. A skill import test claimed that files inside scripts/ were extracted. But the test only checked that the skill record existed. It never checked whether the actual file was written to disk. An MCP transport test claimed HTTP or SSE coverage, but the local test server did not even expose real MCP routes. The non-fixme path only proved that a failure-shaped result object could be returned. So the test names were stronger than the actual validation. I also saw contract mismatch inside individual tests. One prompt asked for a short output such as just the translation. But the assertion required the response length to be greater than a large threshold. That means a correct answer like Hola. could fail. This is not a model creativity issue. This is a direct contradiction between prompt contract and assertion contract. The review step had the same problem. Claude could produce status reports, summarize progress, and say that files were reviewed. But later inspection still found calls to non-existent endpoints, fragile selectors, fake coverage, weak assertions, and tests that treated old data inconsistency as if it were a new implementation failure. So my problem is not simply that Claude Code makes mistakes. My real problem is this: It can describe the issue correctly, but still fail to fix it. It can acknowledge missing work, but still leave it unfinished. It can be given 3 tasks, complete 1, and silently drop the other 2. It can report progress before the implementation is actually correct. It can produce tests that look complete while important behavior is still unverified. That is the part I find most frustrating. The failure mode is not random. The failure mode is systematic. It tends to optimize for visible progress, partial completion, and plausible output structure. It is much weaker at strict follow through, full task completion, and technical verification of real behavior. That is exactly what I kept running into. submitted by /u/takeurhand

Originally posted by u/takeurhand on r/ClaudeCode