A Walk to the Heavens

Six started the ascent, carrying the weights of their past. Only one survived

Chindu Sreedharan
1.

We walked through the day and the night and then another day. The peaks of Himavan came into view, jagged and white against the bleeding sky. By the time we reached the foothills, the air had grown cold. I looked at the others.

Draupadi had collapsed. She leaned against a boulder, eyes shut. Her hair, once dark and lustrous, was matted with dust. Nakula and Sahadeva sat beside her. Arjuna stood apart, gazing at a distant point on the slope. Only Bhima seemed unaffected.

Himavan! On which side was Shatasring, which had caressed us as children? I could see the forest, thick and dark, rising in ledges towards the peaks. Somewhere up there is Meru, which we had to cross. After that everything would end. Yoganidra!

"Elder Brother," Nakula said. "The light is gone. Let's rest here tonight."

No. This was once our past. The past no longer exists for us. When memories and hopes are wiped out, the mind becomes still, unwavering. Pure as crystal. We must not look back. We must walk on. Those were the rules.

I moved forward.

2.

The forest was unlike any I had seen.

The trees grew in clusters that defied the logic of the earth. In places they huddled so close we could barely pass through. In others they stood apart, like guards at the gates of a fortress.

One grove ended and another began without warning. I tasted rain, though the leaves above were dry. A few steps later it was gone. Once, from somewhere deeper in the trees, a flute-note rose and fell.

Then the path turned into a jagged climb through thorns that tore at our skin.

When the moon broke through the branches, I saw a shadow. For a moment it turned to look at me.

Then the leaves shifted and it was gone.

3.

"Wait, Elder Brother! Draupadi has fallen!"

Draupadi! I loved her dearly, but she only saw Arjuna. Even when she sat beside me for the Rajasooya sacrifice, it was him she watched. Draupadi did not have the strength for this journey.

Without turning, I said, "Do not wait for those who have fallen."

Ahead, the shadow waited. It wasn't my imagination.

4.

The path turned when it should have climbed. Once I saw a stone that looked like an angry bull. Then I saw it again.

Sahadeva fell first. Then Nakula. My poor children—they were not old enough for this path.

By the time the sky turned colour, we broke through a thicket. The boulder Draupadi had rested against was still there.

We were back at the foothills.

5.

I had grown used to the shadow. It trotted in front. Occasionally, it stopped, waiting for me.

Footfalls. Bhima. "Arjuna fell."

When I did not respond, he said, "It is my turn, next?"

"Walk," I said. "Do not think of what will be."

Bhima went past me. For a while I could hear his steps, heavy on the earth. My dearest brother, the one closest to my age, my strength. He who never knew when to stop.

6.

"It is you and me now," I said.

The shadow looked back, baring its teeth in a smile.

We climbed. We climbed till I could no longer say whether it was day or night. More than once I thought I had left the foothills behind. More than once I found them again.

The shadow trotted at my heels. We came to a steep rise. The shadow stopped. When I couldn't coax it forward, I lifted it in my arms. Then I walked up the slope, one foot in front of the other.

Sunlight!

7.

The cursor blinked. Then words began to appear:

A large language model is mostly a file. Everything a computer holds — text, images, sound — is stored as numbers. A model is no different: a vast set of numbers arranged into layers. Llama 70BWe know this because Llama is open-weight; Meta has released the actual numbers that constitute the model for anyone to inspect. Most frontier models are sealed: you can converse with GPT-4 or Claude, but their weights are not available to you. Llama is an exception. And what the open weights reveal, when probed layer by layer, is a structure with a rough division of labour: the lower layers handle logic, pattern recognition, and language — the grammar of thought, one might say, without the content. Factual knowledge and computational power accumulate towards the top. The final layer, in particular, carries the weight of what the model actually knows about the world. Eighty floors, with the reference volumes shelved at the top., for instance, is eighty layers deep. At its most compressed, it is about forty gigabytes, the size of a film collectionIt is rather astonishing when you think about it: a whole system that, not a decade ago, would have required the kind of computing infrastructure that only Silicon Valley could afford. This transformation, to a notable extent, is founded on an approach taken by Ashish Vaswani and colleagues at Google. In a paper with the wonderfully confident title Attention Is All You Need, they proposed "a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely". Before this, teaching machines to process language had been, frankly, a complicated business. The Google researchers cut through the previous "convolutions". Rather than reading left to right, one word at a time, the transformer could look at every word in relation to every other word simultaneously — the whole sentence at once, the whole paragraph, all its relationships and dependencies, processed in parallel. It was, in computational terms, the difference between reading a page word by word and taking it in at a glance. Training time reduced from weeks to days. And the result, once trained, small enough to live in a file that you could download before lunch on a decent broadband..

The numbers within each layer are organised into matrices, grids, each encoding relationships learned during training. These are what we refer to as 'weights', the record of every tiny adjustmentIn 1986, psychologist David Rumelhart, together with Geoffrey Hinton and Ronald Williams, published a paper in Nature that solved a critical problem in artificial intelligence. How do you teach a network of connected units to learn from its own mistakes? Their answer — backpropagation — worked by sending the error backwards through every connection after each wrong guess, adjusting each weight slightly in the direction of a better answer. The procedure, as they wrote, "repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector". The model training and the weights we hear so much about today are built on their seminal work — a deceptively quiet publication, less than three pages. The trees and the groves and the thorns that catch, think of those as the weights. They touch the travellers. They transform them. The less connected, the less salient, collapses. The more meaningful — in the eyes of the forest, that is — becomes the response. made across months of exposure to vast amounts of human writing: books, arguments, poetry, code. For each stretch of text the model encountered, it tried to predict the next token. When it was wrong, the weights changed slightly, in the direction of a better guess. When it was right, they held. After long months of this, across thousands of specialised chips, the weights were frozen. What they produce emerges only when they work together — the way water emerges from hydrogen and oxygen without resembling either.

When you send a prompt, it enters as tokens — fragments of words, each converted to its own set of numbers, distinct from every other token. Those numbers pass through the first layer. The matrix of weights transforms them and passes the result forward. That result enters the second layer. And so on, through all eighty. For each next tokenNext-token prediction, the deceptively simple objective that drives the training of every large language model described here, is, at its core, just this: guess the next word. In Language Models are Unsupervised Multitask Learners, Radford and colleagues demonstrated that a model trained on nothing but this begins, at scale, to learn translation without being taught to translate, to summarise without being told what a summary is. Later, Kaplan, McCandlish, and Amodei — all then at OpenAI — showed in Scaling Laws for Neural Language Models that performance improved smoothly, reliably, like compound interest, as compute, data, and model size grew. This was when we began to see the billions being pumped into AI — if it was just about scale, well, that only takes money. of the response, the model considers your prompt and everything it has produced so far, and makes another full pass through the layers. Then it does it again for the next token. And again. What we experience as fluent language is the visible result of those repeated hidden passages. Even now, we do not fully understand how these vast numerical passages resolve into particular meanings, judgements, or turns of phraseThis is the ghost haunting the AI community — and it is, when you think about it, a rather embarrassing one. We built the thing. We know exactly what went into it: equations, electricity, and an almost deranged quantity of (stolen) human text. And yet, if you ask why it said what it just said, the honest answer is: we have no idea. Dario Amodei, who runs Anthropic, put it: "When a generative AI system does something, like summarize a financial document, we have no idea, at a specific or precise level, why it makes the choices it does—why it chooses certain words over others, or why it occasionally makes a mistake despite usually being accurate". A whole field has since sprung up to solve this mystery. Mechanistic interpretability, as Bereska defines it, is an approach to "reverse engineering neural networks into human-understandable algorithms and concepts". Its practitioners open up the hood and try to work out, layer by layer, what each part is doing, and why. In essence, they are cataloguing the trees in a forest that keeps growing..

A metaphor might help. In the great ancient Indian epic Mahabharata, the five Pandava brothers and their wife Draupadi make a final journey on foot to the heavens. One by one, the companions fall. Each is undone by something they could not leave behind. Only Yudhishtira, the eldest, walks on — and a dog, which has followed him.

At the gates of heaven, he is told he may enter. But not the dog. He refuses. He will not abandon what has been true to him. The dog, it turns out, was the final test. In a model, as in that journey, what persists is not random. It is what the weights determined to be trueThe era of next-token prediction, it would appear, is ending. As Sutskever — OpenAI's former chief scientist, one of the architects of that era — has said publicly, pre-training is plateauing. Shumailov and colleagues showed in 2024 that models trained on AI-generated data set off a "degenerative process" — the data they generate "end up polluting the training set of the next generation". Being trained on that polluted data, the models "mis-perceive reality". It has a slightly eerie ring, that phrase, when applied to a file of numbers. And yet it is the right word. The internet is already filling with machine-made text. The models of tomorrow will train on it. Like Draupadi and the younger Pandavas in our story, one by one, the companions are falling. What might survive — or so Dupoux, LeCun and Malik argue in Why AI Systems Don't Learn and What to Do About It — is another approach entirely, transcending crude prediction. Closer, perhaps, to Yudhishtira..

Let me write you the story to illustrate…

Endnotes