A Walk to the Heavens

The cursor blinked. Then words began to appear:

A large language model is mostly a file. Everything a computer holds — text, images, sound — is stored as numbers. A model is no different: a vast set of numbers arranged into layers. Llama 70BWe know this because Llama is open-weight; Meta has released the actual numbers that constitute the model for anyone to inspect. Most frontier models are sealed: you can converse with GPT-4 or Claude, but their weights are not available to you. Llama is an exception. And what the open weights reveal, when probed layer by layer, is a structure with a rough division of labour: the lower layers handle logic, pattern recognition, and language — the grammar of thought, one might say, without the content. Factual knowledge and computational power accumulate towards the top. The final layer, in particular, carries the weight of what the model actually knows about the world. Eighty floors, with the reference volumes shelved at the top., for instance, is eighty layers deep. At its most compressed, it is about forty gigabytes, the size of a film collectionIt is rather astonishing when you think about it: a whole system that, not a decade ago, would have required the kind of computing infrastructure that only Silicon Valley could afford. This transformation, to a notable extent, is founded on an approach taken by Ashish Vaswani and colleagues at Google. In a paper with the wonderfully confident title Attention Is All You Need, they proposed "a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely". Before this, teaching machines to process language had been, frankly, a complicated business. The Google researchers cut through the previous "convolutions". Rather than reading left to right, one word at a time, the transformer could look at every word in relation to every other word simultaneously — the whole sentence at once, the whole paragraph, all its relationships and dependencies, processed in parallel. It was, in computational terms, the difference between reading a page word by word and taking it in at a glance. Training time reduced from weeks to days. And the result, once trained, small enough to live in a file that you could download before lunch on a decent broadband..

The numbers within each layer are organised into matrices, grids, each encoding relationships learned during training. These are what we refer to as 'weights', the record of every tiny adjustmentIn 1986, psychologist David Rumelhart, together with Geoffrey Hinton and Ronald Williams, published a paper in Nature that solved a critical problem in artificial intelligence. How do you teach a network of connected units to learn from its own mistakes? Their answer — backpropagation — worked by sending the error backwards through every connection after each wrong guess, adjusting each weight slightly in the direction of a better answer. The procedure, as they wrote, "repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector". The model training and the weights we hear so much about today are built on their seminal work — a deceptively quiet publication, less than three pages. The trees and the groves and the thorns that catch, think of those as the weights. They touch the travellers. They transform them. The less connected, the less salient, collapses. The more meaningful — in the eyes of the forest, that is — becomes the response. made across months of exposure to vast amounts of human writing: books, arguments, poetry, code. For each stretch of text the model encountered, it tried to predict the next token. When it was wrong, the weights changed slightly, in the direction of a better guess. When it was right, they held. After long months of this, across thousands of specialised chips, the weights were frozen. What they produce emerges only when they work together — the way water emerges from hydrogen and oxygen without resembling either.

When you send a prompt, it enters as tokens — fragments of words, each converted to its own set of numbers, distinct from every other token. Those numbers pass through the first layer. The matrix of weights transforms them and passes the result forward. That result enters the second layer. And so on, through all eighty. For each next tokenNext-token prediction, the deceptively simple objective that drives the training of every large language model described here, is, at its core, just this: guess the next word. In Language Models are Unsupervised Multitask Learners, Radford and colleagues demonstrated that a model trained on nothing but this begins, at scale, to learn translation without being taught to translate, to summarise without being told what a summary is. Later, Kaplan, McCandlish, and Amodei — all then at OpenAI — showed in Scaling Laws for Neural Language Models that performance improved smoothly, reliably, like compound interest, as compute, data, and model size grew. This was when we began to see the billions being pumped into AI — if it was just about scale, well, that only takes money. of the response, the model considers your prompt and everything it has produced so far, and makes another full pass through the layers. Then it does it again for the next token. And again. What we experience as fluent language is the visible result of those repeated hidden passages. Even now, we do not fully understand how these vast numerical passages resolve into particular meanings, judgements, or turns of phraseThis is the ghost haunting the AI community — and it is, when you think about it, a rather embarrassing one. We built the thing. We know exactly what went into it: equations, electricity, and an almost deranged quantity of (stolen) human text. And yet, if you ask why it said what it just said, the honest answer is: we have no idea. Dario Amodei, who runs Anthropic, put it: "When a generative AI system does something, like summarize a financial document, we have no idea, at a specific or precise level, why it makes the choices it does—why it chooses certain words over others, or why it occasionally makes a mistake despite usually being accurate". A whole field has since sprung up to solve this mystery. Mechanistic interpretability, as Bereska defines it, is an approach to "reverse engineering neural networks into human-understandable algorithms and concepts". Its practitioners open up the hood and try to work out, layer by layer, what each part is doing, and why. In essence, they are cataloguing the trees in a forest that keeps growing..

A metaphor might help. In the great ancient Indian epic Mahabharata, the five Pandava brothers and their wife Draupadi make a final journey on foot to the heavens. One by one, the companions fall. Each is undone by something they could not leave behind. Only Yudhishtira, the eldest, walks on — and a dog, which has followed him.

At the gates of heaven, he is told he may enter. But not the dog. He refuses. He will not abandon what has been true to him. The dog, it turns out, was the final test. In a model, as in that journey, what persists is not random. It is what the weights determined to be trueThe era of next-token prediction, it would appear, is ending. As Sutskever — OpenAI's former chief scientist, one of the architects of that era — has said publicly, pre-training is plateauing. Shumailov and colleagues showed in 2024 that models trained on AI-generated data set off a "degenerative process" — the data they generate "end up polluting the training set of the next generation". Being trained on that polluted data, the models "mis-perceive reality". It has a slightly eerie ring, that phrase, when applied to a file of numbers. And yet it is the right word. The internet is already filling with machine-made text. The models of tomorrow will train on it. Like Draupadi and the younger Pandavas in our story, one by one, the companions are falling. What might survive — or so Dupoux, LeCun and Malik argue in Why AI Systems Don't Learn and What to Do About It — is another approach entirely, transcending crude prediction. Closer, perhaps, to Yudhishtira..

Let me write you the story to illustrate…

A Walk to the Heavens

Six started the ascent, carrying the weights of their past. Only one survived

Endnotes