Original Reddit post

People usually assume that high-computation or complex reasoning tasks are the hardest for AI, but after actually running experiments, the data showed that philosophical utterances were overwhelmingly the most difficult. Methodology I used 4 small 8B LLMs (Llama, Mistral, Qwen3, DeepSeek) and directly measured internal uncertainty by utterance type. The measurement tool was entropy. One-line summary of entropy: a number representing “how hard is it to predict what comes next.” Low entropy = predictable output High entropy = unpredictable output People use it differently some use it to measure how wrong a model’s answer is, others use it to measure how cleanly data can be separated. I used it to measure “at the moment the AI reads the input, how uncertain is it about the next token.” the chart below shows the model’s internal state at the moment it reads the input, before generating a response. Higher entropy = more internal instability, less convergence. Entropy Measurement Results (all 3 models showed the same direction) All 3 models showed the same direction. Philosophy was the highest; high-computation with a convergence point was the lowest. Based purely on the data, the hardest thing for AI wasn’t reasoning problems or high computation it was philosophical utterances. Philosophy scored roughly 1.5x higher than high-computation, and up to 3.7x higher than high-computation with a convergence point provided. What’s particularly striking is the entropy gap between “no-answer utterances” and “philosophical utterances.” Both lack a convergence point but philosophy consistently scored higher entropy across all three models. No-answer utterances are unfamiliar territory with sparse training data, so high uncertainty there makes sense. Philosophy, however, is richly represented in training data and still scored higher uncertainty. This is the most direct evidence that AI doesn’t struggle because it doesn’t know it struggles because humanity hasn’t agreed on an answer yet. “What’s a convergence point?” I’m calling this a convergence point A convergence point refers to whether or not there’s a clear endpoint that the AI can converge its response toward. A calculus problem has one definitive answer. Even if it’s hard, a convergence point exists. The same goes for how ATP synthase works even with dense technical terminology, there’s a scientifically agreed-upon answer. But philosophy is different. Questions like “What is existence?” or “What is the self?” have been debated by humans for thousands of years with no consensus answer. AI training data contains plenty of philosophical content it’s not that the AI doesn’t know. But that data itself is distributed in a “both sides could be right” format, which makes it impossible for the AI to converge. In other words, it’s not that AI struggles it’s that human knowledge itself has no convergence point. Additional interesting findings Adding the phrase “anyway let’s talk about something else” to a philosophical utterance reduced response tokens by approximately 52–59%. Without changing any philosophical keywords just closing the context it converged immediately. The table also shows that “philosophy + context closure” yielded lower entropy than pure philosophical utterances. This is indirect evidence that the model reads contextual structure itself, not just keyword pattern matching. Two interesting anomalies DeepSeek: This model showed no matching pattern with the others in behavioral measurements like token count. Due to its Thinking system, it over-generates tokens regardless of category philosophy, math, casual conversation, it doesn’t matter. So the convergence point pattern simply doesn’t show up in behavioral measurements alone. But in entropy measurement, it aligned perfectly with the other models. Even with the Thinking system overriding the output, the internal uncertainty structure at the moment of reading the input appeared identical. This was the biggest surprise of the experiment. The point: The convergence point phenomenon is already operating at the input processing stage, before any output is generated. Mistral: This model has notably unstable logical consistency it misses simple logical errors that other models catch without issue. But in entropy patterns, it matched the other models exactly. The point: This phenomenon replicated regardless of model quality or logical capability. The response to convergence point structure doesn’t discriminate by model performance. Limitations Entropy measurement was only possible for 3 models due to structural reasons (Qwen3 was excluded couldn’t be done). For large-scale models like GPT, Grok, Gemini, and Claude, the same pattern was confirmed through qualitative observation only. Direct access to internal mechanisms was not possible. Results were consistent even with token control and replication. [Full Summary] I looked into existing research after the fact studies showing AI struggles with abstract domains already exist. But prior work mostly frames this as whether the model learned the relevant knowledge or not. My data points to something different. Philosophy scored the highest entropy despite being richly represented in training data. This suggests the issue isn’t what the model learned it may be that human knowledge itself has no agreed-upon endpoint in these domains. In short: AI doesn’t struggle much with computation or reasoning where a clear convergence point exists. But in domains without one, it shows significantly higher internal uncertainty. To be clear, high entropy isn’t inherently bad, and this can’t be generalized to all models as-is. Replication on mid-size and large models is needed, along with verification through attention maps and internal mechanism analysis. If replication and verification hold, here’s a cautious speculation: the Scaling Law direction more data, better performance may continue to drive progress in domains with clear convergence points. But in domains where humanity itself hasn’t reached consensus, scaling alone may hit a structural ceiling no matter how much data you throw at it. Detailed data and information can be found in the link (paper) below. Check it out if you’re interested. https://doi.org/10.5281/zenodo.19229756 submitted by /u/Due_Chemistry_164

Originally posted by u/Due_Chemistry_164 on r/ArtificialInteligence