Original Reddit post

I am overclocking DDR5 memory, logging changes, stability testing, and benchmarking as I go. After each round, I would ask Fable to average data and look for marginal instability in the form of benchmark regressions or variance. My logging is extremely consistent and extensive. I finished after feeding Fable about 100 total documents, each with 2–20 different values I wanted aggregated/organized per-round. Each round had multiple trials across multiple cold boots, and I wanted to record the averages and confidence intervals in a few different tables, keep a per-round log of timings/voltages, and flag performance regressions. Simple. My instructions were step-wise and highly detailed. I ask every Claude instance to rephrase its task in full detail, in its own words, before beginning, and then to also describe the purpose. I don’t explain the purpose; I want to see if the model can infer it. Fable outlined the task and purpose perfectly. There was zero ambiguity. I’m very confident with my prompting methods as they rarely fail, but am open to hearing people out if they have found that Fable behaves better with certain methods over others. I blew through a 5x Max plan plus $25 of extra credits to get it done. Then it said something blatantly wrong, stating a certain metric’s value incorrectly, so of course I whipped up Opus to double-check the finalized markdown. 257 of the 1,098 total cells were incorrect, starting halfway through the workflow. So I went back and spot-checked a dozen or so and Opus was right. I inspected the scripts and found that Fable was transcribing the results directly into the scripts it used to analyze my data. It was faithful up to a point where it stopped looking at the data and started following a narrative. Up until halfway through the workflow it would fabricate a set of plausible values that fit the narrative of steadily improving memory performance, and input them into the scripts. The fingerprint was easy to realize with one specific file: MLC bandwidths and latencies. Across two runs, every single one of the 19 delays differed by -66.6 +/- 0.19. To make things worse, it’s thought process regularly stated that it should spot-check its work to ensure accuracy. Case 1: No part of the script imported my data files, and these fabricated MLC delays are written in. These values follow a curve plus a constant, which is not how things work in MLC (-66+/-19). b1=[90104.7,90379.2,89512.3,89226.9,90124.6,90042.3,90238.7,90646.5,85662.4,71462.7,53357.0,39002.7,30875.4,24363.5,17358.9,13034.6,9700.2,6210.5,3777.1] b2=[90171.5,90446.2,89578.9,89293.4,90191.4,90109.0,90305.6,90713.6,85729.0,71529.3,53423.5,39069.2,30941.9,24430.0,17425.4,13101.1,9766.7,6277.0,3843.6] Case 2: This one made me laugh, it intentionally “corrected” the round 5 data by adjusting round 4, which was also wrong print(“=== R4 FFT CORRECTION ===”) m,s=ms([3.02,3.01]); print(f"R4 FFT: {m:.3f} ± {s:.3f}; vs R0 {(m-2.900)/2.900100:+.2f}%; vs R3 {(m-2.980)/2.980100:+.2f}%") I asked, “Why does it seem like you hand-typed data into the scripts rather than importing them from my files?” It admitted to transcribing results rather than importing them, and owned up to how it got lazy and just continued the pattern it saw in the first couple of rounds. During its thought process after asking this question, it seemed to struggle with the UTF-16 LE encoding for a little, so I think that might be where the issue lies but cannot say for certain since it eventually was able to figure it out. TL;DR: I fed Fable structured data and asked for it to perform simple data analysis iteratively over a long chat and it failed miserably while eating 100% of my 5x plan plus extra credits. It seems to have written a bad script and gave up, deciding to fabricate data instead of fix it. submitted by /u/EliHusky

Originally posted by u/EliHusky on r/ClaudeCode