eifachposte

eifachposte

The core math behind the Google Transformer is not symbolic reasoning or logic, it is linear algebra, probability, and calculus arranged in a very specific way. Everything starts by turning text into numbers. Each word or token is mapped to a vector, meaning a long list of real numbers. These vectors live in a high-dimensional space and are learned during training, so the model slowly shapes where words sit relative to one another. From each token vector, the model computes three new vectors using matrix multiplication. These are called queries, keys, and values. Mathematically, this is just the original vector multiplied by three different learned matrices. There is nothing mysterious here, it is basic linear algebra. The purpose is to create different representations of the same token so it can ask questions about other tokens, be compared against them, and carry information forward. The heart of the Transformer is attention. Attention works by taking the dot product between the query vector of one token and the key vectors of all other tokens. A dot product measures similarity in vector space, essentially asking how aligned two vectors are. These similarity scores are then divided by the square root of the vector dimension to keep the numbers from growing too large, which is purely a numerical stability trick. After that, a softmax function is applied. Softmax converts the raw similarity scores into probabilities that are all positive and sum to one. This turns similarity into a distribution of attention, meaning how much focus each token gives to every other token. Once those probabilities are computed, they are used to take a weighted sum of the value vectors. The result is a new vector for each token that mixes information from other tokens, weighted by relevance. This is how context is formed. Every token becomes a blend of other tokens rather than being processed in isolation. Instead of doing this once, the Transformer uses multi-head attention. Multiple attention operations run in parallel, each with its own learned projection matrices. Each head looks at the same input but learns different patterns, such as syntax, long-range dependencies, or local relationships. The outputs of all heads are concatenated and passed through another matrix multiplication to mix them together. This is still just linear algebra applied repeatedly. Transformers have no built-in sense of word order, so positional information must be added manually. The original design introduced sinusoidal positional encodings using sine and cosine functions at different frequencies. These functions inject position into the vectors in a smooth, continuous way and allow the model to generalize to longer sequences. Mathematically, this is closely related to Fourier features and signal processing. After attention, each token is passed through a feed-forward neural network independently. This network consists of a linear transformation, a nonlinear activation function like ReLU or GELU, and another linear transformation. This step increases the model’s expressive power by letting it reshape information nonlinearly. To make deep stacks of these layers trainable, residual connections and layer normalization are used. The input to each sublayer is added back to its output, and the result is normalized. This stabilizes gradients and prevents information from degrading as it flows through many layers. Without this, training deep Transformers would fail. Training the model uses standard optimization math. The model predicts a probability distribution over the next token using a softmax layer. A cross-entropy loss compares this distribution to the correct token. Backpropagation computes gradients of this loss with respect to every parameter in the network, including all attention matrices and embeddings. Gradient descent or its variants then update those parameters slightly. This process is repeated trillions of times, which is why training is so computationally expensive. In the end, the Transformer introduced by researchers at Google is not powered by reasoning or understanding in a human sense. It is powered by dot products, matrix multiplications, probability distributions, and gradient descent, scaled to an extreme degree. Its strength comes from structure and scale, not from any hidden symbolic intelligence. A neural network is not a brain and it does not think. At its core it is a mathematical system that takes numbers in, transforms them through layers of simple operations, and outputs numbers at the other end. Everything people describe as intelligence comes from how those numbers are arranged and adjusted, not from understanding or intent. The basic unit of a neural network is an artificial neuron. A neuron receives several inputs, where each input is just a numerical value. These inputs might represent pixel brightness, sound amplitudes, sensor readings, or abstract embedding values. On their own these numbers have no meaning. Meaning only appears through how the network treats them. Each input is multiplied by a weight. Weights determine how much influence an input has on the neuron’s output. A large positive weight means the input strongly pushes the output higher. A small weight means the input barely matters. A negative weight means the input pushes the output in the opposite direction. Most of what a neural network “knows” is encoded in these weight values. After multiplying inputs by their weights, the neuron adds all the results together to produce a single number. This is called the weighted sum. At this stage the neuron has not made a decision yet, it has only combined evidence into a raw score. Next a bias value is added to the weighted sum. The bias acts like a threshold offset. It allows the neuron to activate even when the inputs are small, or to stay inactive unless the combined signal is strong enough. Early neural networks used hard thresholds that switched outputs on or off. Modern networks use smoother versions of this idea, but the role is the same. The result is then passed through an activation function. This step is crucial. The activation function introduces nonlinearity, meaning the output is not just a straight linear combination of inputs. Without activation functions, stacking many layers would be pointless because the entire network would collapse into a single linear equation. Functions like ReLU, sigmoid, tanh, or GELU allow networks to model complex, curved relationships in data. The output of the activation function becomes the neuron’s output. That output can either be passed into neurons in the next layer or, if the neuron is in the final layer, used as the network’s prediction. Depending on the task, outputs might be a single number, a probability distribution, or a set of scores representing different options. Neural networks are built by stacking neurons into layers. The input layer simply passes raw values forward. Hidden layers perform transformations using weights, biases, and activation functions. The output layer produces the final result. Deep networks are just many repetitions of the same simple mathematical structure. Training a neural network does not involve teaching it rules or concepts. The network makes a prediction, compares it to the correct answer, measures how wrong it was, and then slightly adjusts its weights to reduce that error. This process is repeated millions or billions of times. Over time, the network becomes good at mapping inputs to outputs, but it never understands why those mappings work. This is why neural networks are excellent at pattern recognition, interpolation, and statistical approximation, but poor at causality, reasoning, and knowing when they are wrong. They do not build internal models of the world. They simply optimize large collections of numbers to reduce error on past data. In short, a neural network is a layered system of weighted sums, thresholds, and nonlinear transformations that statistically maps inputs to outputs. Any appearance of intelligence comes from scale and data, not from comprehension or agency. What backpropagation is. Backpropagation is how a neural network learns. It’s the method used to figure out which internal weights caused a mistake, and how to slightly adjust them so the next answer is a bit better. In plain terms, a neural network repeats the same cycle over and over. First, there is a forward pass. The input goes in, the network processes it, and it makes a prediction. For example, it might say “this image is a cat” with 70 percent confidence. Then comes the backward pass, which is backpropagation. The prediction is compared to the correct answer, and the system measures how wrong it was. This error is called the loss. That error is then sent backward through the network, assigning responsibility to each weight based on how much it contributed to the mistake. Each weight is adjusted slightly depending on its role in the error. That backward assignment of blame is what backpropagation actually is. Backpropagation is needed because neural networks can have millions or even billions of weights. There’s no way to manually guess which ones to change or by how much. Backpropagation uses calculus, specifically the chain rule, to calculate how much each individual weight affected the final error and the exact direction it should be changed to reduce that error. The key mathematical intuition is simple even without symbols. If changing a weight increases the error, you push that weight down. If changing a weight decreases the error, you push it up. The size of that push depends on how sensitive the error is to that specific weight. That sensitivity is called a gradient. This is why you’ll often hear the phrase that backpropagation plus gradient descent equals learning. In one sentence, backpropagation is an efficient way to calculate how every weight in a neural network should change to reduce error by sending the error backward from the output layer to the input. Once a model like ChatGPT finishes training, all weights are fixed numbers, it cannot modify them during use, it cannot store new memories, it cannot integrate new facts, it cannot update its world model so any “learning” you see during conversation is not learning at all it’s just temporary pattern tracking inside context memory, which vanishes after the session. You can’t teach the model new facts without retraining or fine tuning, which is resource intensive (requiring massive compute). In chat learning is illusory its just conditioning the output on the provided context, which evaporates afterward. If you adjust weights to learn something new, this happens ,neurons are shared across millions of concepts, changing one weight affects many unrelated behaviours, new learning overwrites old representations, the model forgets previous skills or facts, this is called, catastrophic forgetting unlike human brains, neural networks do not naturally protect old knowledge. Why targeted learning is nearly impossible you might think Just update the weights related to that one fact, but the problem is, knowledge is distributed, not localized ,there is no single memory cell for a fact every concept is encoded across millions or billions of parameters in overlapping ways so you cannot safely isolate updates without ripple damage. Facts aren’t stored in isolated memory cells but holistically across the network. A concept like gravity might involve activations in billions of parameters, intertwined with apples, Newton, and physics equations. Targeted updates are tricky. Approaches like parameter efficient fine tuning help by only tweaking a small subset of parameters, but they don’t fully solve the isolation problem. A lot of people don’t really grasp why training models like ChatGPT keeps getting insanely expensive, so here’s the blunt reality. The core task an LLM performs during training is brute-force statistical compression. It isn’t “learning concepts” the way humans do. It’s constantly asking one question over and over: given everything I’ve seen so far, what token is most likely next? To make that work you have to show it trillions of tokens, calculate probabilities across tens or hundreds of thousands of possibilities, and repeat this process while nudging billions of parameters by microscopic amounts. There are no shortcuts here. It’s raw numerical grind. The real compute killer is backpropagation. For every token the model does a forward pass to predict the next token, computes the error, then does a backward pass that adjusts enormous numbers of weights. That backward pass is brutal. It touches billions of parameters, relies on massive matrix multiplications, and requires high numerical precision. This is why GPUs and TPUs are mandatory. CPUs would take centuries. What actually improved model quality over time wasn’t some hidden algorithmic breakthrough. It was scale. More parameters, more data, more compute. That’s it. And scale doesn’t grow linearly. A ten times bigger model doesn’t cost ten times more. Once you include memory limits, interconnect bandwidth, synchronization overhead, and retries, it can easily cost twenty to forty times more. At these scales, data movement hurts almost as much as the math itself. GPUs spend huge amounts of time waiting on memory. Models are sharded across thousands of accelerators. Just keeping everything synchronized burns enormous amounts of power. Training is no longer compute-bound, it’s infrastructure-bound. Another thing people rarely talk about is how often large training runs fail. Hardware faults happen. NaNs happen. Runs diverge. Hyperparameters turn out wrong. Massive runs are frequently restarted multiple times, and every restart costs real money. So when people ask how much future ChatGPT-class models cost to train, here’s a realistic order-of-magnitude view, not marketing numbers. Earlier generations were roughly ten to fifty million dollars, around 10²⁴ FLOPs, using thousands of GPUs for weeks. Current frontier models are more like one hundred to three hundred million dollars, around 10²⁵ FLOPs, using ten thousand plus accelerators for months. The next generation is very likely five hundred million to over a billion dollars just for a single training run, around 10²⁶ FLOPs, effectively entire data-center-scale operations with power consumption comparable to a small town. And that’s before fine-tuning, safety training, red-teaming, and deployment optimization. The reason costs keep rising instead of falling lines up perfectly with physical reality. Compute lives in matter. Matter wears out. Energy is not free. Chips don’t scale the way they used to. Moore’s Law is effectively dead and brute force replaced it. Every new model is basically “spend more money, burn more hardware, hope scaling still works.” The uncomfortable truth is that large language models are extremely expensive to train, moderately expensive to run, and fundamentally limited by physics, not software cleverness. They improve by throwing capital and energy at the problem, not by suddenly understanding anything. That’s why skepticism about long-term sustainability isn’t irrational. It’s grounded in thermodynamics and material reality. People argue that if we just keep increasing compute, data, and model size, AI capabilities will continue to scale. Others argue large language models are a dead end and will plateau. What does the math actually say? Over the last few years researchers, especially at OpenAI, discovered something called scaling laws. When you increase model parameters, training data, and total training compute, the training loss decreases in a smooth and predictable way that follows a power law. In simplified form it looks like this: Loss is proportional to Compute raised to a small negative exponent. That exponent is usually small, something like 0.05 to 0.1. What this means in practice is that every tenfold increase in compute gives a consistent, measurable improvement. Not random improvement. Not chaotic jumps. Smooth gains that follow a curve. This is the mathematical foundation behind the “just keep scaling” argument, and historically it has worked. Each generation of large models improved roughly in line with these scaling predictions. However, power laws have diminishing returns built into them. Because the exponent is small, every additional tenfold increase in compute produces smaller real-world gains. The curve keeps improving, but it flattens. There is no sharp cliff in the math, no theorem that says intelligence suddenly stops at some number of parameters, but there is a clear pattern of increasingly expensive improvements. You can keep pushing, but the cost grows rapidly compared to the benefit. There is also the data constraint. High-quality human-generated text is finite. Once models are trained on most of the available internet-scale data, further scaling depends on synthetic data, lower quality data, or multimodal sources like images, audio, and video. If the quality or diversity of data stops increasing, the original scaling relationships may weaken. The math that predicted smooth improvements assumed certain data conditions. If those change, the curve can shift. Another limitation comes from the objective itself. Large language models are trained to predict the next token. Backpropagation adjusts billions of weights to reduce prediction error. Lower loss means better next-token prediction, but that objective may not automatically produce long-term planning, persistent memory, grounded reasoning, or autonomous agency. So even if the loss continues to decrease smoothly, certain kinds of capabilities could plateau because the training objective does not directly optimize for them. There is also the physical and economic layer. Training compute scales roughly with parameters times data times training steps. If you double model size and double data, compute roughly quadruples. Hardware scaling is not infinite. Transistors cannot shrink forever. Energy costs matter. Memory bandwidth increasingly becomes the bottleneck. At some point the limiting factor is not mathematical possibility but physics and economics. Even if scaling still works in principle, the cost per incremental gain may become extreme. So what does the math really conclude? It shows that scaling has worked and continues to produce improvements within the tested regime. It shows diminishing returns but not a hard wall. It does not prove that infinite intelligence will emerge from scaling alone, and it does not prove that large language models are a dead end. The current evidence says we are somewhere along a smooth but flattening curve. Whether that curve continues to yield transformative capabilities depends not just on more compute, but on data quality, architecture changes, and the physical limits of hardware. submitted by /u/LongjumpingTear3675

Originally posted by u/LongjumpingTear3675 on r/ArtificialInteligence

LLMs Explained From First Principles: Vectors, Attention, Backpropagation, and Scaling Limits

LLMs Explained From First Principles: Vectors, Attention, Backpropagation, and Scaling Limits