A month ago I set up a system to answer a simple question: do large language models have any real predictive signal on short-term Bitcoin prices, or are they just confidently wrong? The setup: Every day at 06:00 UTC an automated script queries 10 models with an identical structured prompt asking for a Bitcoin price prediction 7 days from now. On the target date, I record the actual price and grade it: accuracy = 100 - min(100, abs(predicted - actual) / actual * 100) 100% = perfect. 0% = off by 100% or more. Negative = off by more than 200% (yes, this happened). 7-day leaderboard — 25 graded data points per model: What’s interesting here (and where I’d love your take):
- Perplexity nearly hits 100% on some days. It’s a web-connected model — it can see live BTC prices during inference. That raises a legitimate question: is it actually predicting or just reading the current price and adding noise? The 7-day window means the target date is a week away, so it can’t look it up directly. But its training and web access might give it an edge on sentiment signals. Is this a confound or a valid signal?
- Gemini went to -43% accuracy. This isn’t a one-off — its average over 25 days is 12.2%. Gemini 2.5 Flash is arguably the most capable reasoning model in the benchmark, yet it’s consistently the worst price predictor. My guess: it over-reasons and second-guesses itself into extreme positions. Would love to hear if others have seen similar reasoning-capability ≠ calibration patterns.
- Mistral’s range is 34.5% to 99.7%. The highest single-day accuracy of any model, but also one of the worst floors. It seems bimodal — some days it nails it, some days it’s wildly off. Not sure if this is prompt sensitivity, temperature effects, or something about how Mistral handles numerical uncertainty.
- Qwen and ChatGPT have identical scores. 89.31% average, 87.18% min, 91.34% max — to 2 decimal places. I’m querying them independently with the same prompt. Either they’ve converged on very similar price-prediction heuristics, or there’s something in the prompt that anchors both models to similar outputs. Curious if anyone has a hypothesis.
- Model size/capability doesn’t track accuracy at all. Llama 3.3 70B sits below DeepSeek V3 and Claude. Command R — a much smaller model — beats Grok. The correlation between benchmark performance and price prediction accuracy is effectively zero. Methodological questions I’m genuinely unsure about: Same prompt for all models — is this fair, or should I use model-specific prompting? Feels like it introduces prompt-sensitivity bias but controls for content. Temperature: using defaults for all. Does this matter significantly for numerical outputs? 25 data points is still thin for drawing strong conclusions. What’s your intuition on minimum sample size before the rankings stabilize? Should I be using a different accuracy metric? Log error, MAPE, directional accuracy? The full leaderboard, daily changes, and methodology are at aipredictsbitcoin.com . The short-term predictions page shows individual graded results with the actual vs predicted prices. Feedback welcome, if this is interesting to a lot of people i will update every month submitted by /u/OkFigure5512
Originally posted by u/OkFigure5512 on r/ArtificialInteligence
You must log in or # to comment.
