Back
Blogs
XX MIN READ

The Next Token: Humanity’s Last Mote

Written by
Name
Published
December 12, 2025
Share
Inside this Article

How will we build superhuman AI when superhumans don’t exist to train it? If we know how to do it, why haven’t we already done it? What will happen once we build intelligences that dwarf our own?

Press enter or click to view image in full size

Context

The Second Bump

We’re going to see a second bump with GenAI.

The first bump was the “ChatGPT Moment”. The technology has made possible entirely new features, scenarios, and businesses that were simply out of reach beforehand.

It is a true digital revolution, and we’re living in the race to capitalize on this now.

However, I’d argue the first bump ended with GPT-4.5. That model was gargantuan, and though the official stats have never been released, it is likely the largest AI model ever trained. It did not perform significantly above other models and has since been largely taken offline because its price:quality ratio made it poorly suited for use. This was not because it was trained improperly, but because logarithmically reducing quality gains caught up with the scaling coefficients of the available hardware and data. Basically, there was not enough juice or training data to make a model that big effective.

Instead, the latest advancements and benchmark pushing we’re seeing from LLMs have been coming from optimizations, not primarily new scale. Reasoning, tool use, appropriately placed reinforcement learning steps, these have all been allowing the current generation of models to inch higher up the benchmark chain in 5–10% increments.

However, even though the quality scales logarithmically with data and compute, it still scales. Foundational model companies are now investing in data centers that are between 1 and 3 orders of magnitude ‘larger’ (measured primarily by power consumption) than the current generation. To train GPT-3, OpenAI used ~10,000 GPUs in an Azure data center. Colossus 2, xAI’s datacenter in construction, will come online with 550,000 GPUs, and will scale to 1,000,000 GPUs in phase 2 of its construction. All of the GPUs are of a newer generation, as well. Oracle is building a data center powered by >100k NVIDIA Blackwell GPUs that each cost over $1M, require independent liquid cooling, and the data center itself is powered by several nuclear reactors. To say nothing of Project Stargate (OpenAI’s $500B data center plans), or even the major clouds themselves.

The >$1 trillion being spent on data center scaling, combined with the additional data that is being farmed from non-text modalities (video, in particular), and increasingly accurate synthetic training data (also produced by AI models), will enable humanity to ‘punch through’ the GPT-4.5 plateau.

We will produce foundational models which enable entirely new businesses and, provocatively, ways of life, even before the inevitable optimizations begin to push them further.

This “Second Bump” in AI capability will likely arrive in 12–18 months as the new datacenters come online and the multi-month training cycles for the new generation of models complete. It’s unclear how available to the public these models will be, because their capabilities will likely be startlingly vast.

This Second Bump may be the last one for a long while, because further increasing the scale through raw compute and data would only be economically possible with fusion based power sources. There is another way to get to a third bump, however…

Compressibility of Intelligence

Knowledge is vast. You need to store a lot to know a lot. These really big AI models require billions of parameters and tons of memory because when we ask them questions about random stuff, no matter the topic, we expect them to know the answer.

But knowledge is not intelligence.

For example, imagine you didn’t know the rules to chess and you could never learn them, but you had memorized every game of chess ever played. This would require a massive amount of knowledge. If you were faced with a novel chess position and asked what the optimal move was, your knowledge alone would not help you. You would need to apply intelligence to solve for the new position, and since you were completely ignorant of how to play, you’d be unable to do so.

This is why an LLM differs from just a giant database. Databases do not have the ability to reason.

But intelligence is not effective without knowledge.

Imagine you knew all the rules of chess stone cold, and you even had the potential to be brilliantly talented at it, but had never witnessed a game. You would get smoked by an average player. Just go watch the opening episode of The Queen’s Gambit.

This is why you can’t take an ignorant LLM that has genius level reasoning capability, give it access to the internet, and expect it to achieve good results. Without some knowledge of the way the world works, it lacks the ability to leverage its intelligence to piece together a coherent response.

If you look at the structure of LLMs, the more information they store, the larger the number of “nodes” their neural networks must have. Their nodes are like neurons in a brain, and the connections between them are like synapses. The data is not stored somewhere else; it’s literally in the weights and biases of the nodes themselves. However, their ability to reason is all about the arrangement of those nodes. Simply adding more nodes does not equate to a smarter model, only a more knowledgeable one. Intelligence requires the nodes to be configured to work with each other properly.

Artificial Intelligence is therefore a configuration of knowledge.

The goal of the race to be the #1 AI model company is to balance these characteristics. Find the optimal configuration of the nodes to increase reasoning intelligence, while at the same time finding the optimal number of nodes to pre-bake enough knowledge into the model so that it can make use of that reasoning. Too many nodes relative to its reasoning capability and the model is too unwieldy to be cost effective (GPT-4.5). If the model can reason fantastically but hallucinates constantly because it holds no context of the underlying relationships between concepts, it doesn’t matter how cheap it is (Deepseek).

As soon as you change one characteristic, the other has to be balanced as well. So, why don’t we just balance them optimally?

We don’t know how.

The Science of Intelligence

In the 16th and 17th centuries, alchemy was state of the art for medicine. Picture yourself with a terrible headache, walking into an apothecary for some help. Behind the counter was an aged man surrounded by thick tomes and hundreds of glass jars and bins containing a variety of strange materials from all over the world.

“I have a really bad headache”, you say.

After thinking for a while, the man says “Ah, of course. I can whip up just the thing to help you get some relief. A Willow Bark Tincture should do it, and lucky you, I just received some fresh Willow Bark last week.”

The alchemist then proceeds to look amongst his array of tomes, selects one of the thicker ones, and starts flipping through the hundreds of pages to find the exact recipe. He boils a precise number of shavings of Willow Bark in wine, creating a murky decoction. He tells you to drink a few sips whenever you’ve got a headache.

To your surprise, it works! But it’s not enough. You get really, really bad headaches. Migraines, without the word to express it.

Upon returning to the alchemist, he studies you over. Narrowed eyes, sensitivity to noise, a weakness about the shoulders. Yes, there’s no avoiding it, you need a true cure-all: Theriac.

The alchemist brings out his largest tome. He’s studied making cure-alls for decades. He’s attended the most prestigious universities, he’s practiced on countless patients, he’s prepared. And he needs to be. Making Theriac is extraordinarily difficult. Following the complex instructions, he sets to work combining cinnamon, myrrh, saffron, flesh of viper, beaver scent glands (yes, really), opium poppy juice, asphalt (again, really), aged honey, and 60 more expensive ingredients into a thick, dark paste that makes just about everything seem… less.

Alchemists were the pre-eminent researchers of their day. They experimented with new ingredients, kept detailed notes, shared, taught, and sold their findings, and spent their lifetimes perfecting this trade. Sir Isaac Newton was an alchemist for 30 years.

It took a lifetime of dedication, study, and practice to make these tinctures because they lacked knowledge of the underlying science: Chemistry. Starting in the 18th century, chemists stopped blindly experimenting and memorizing what worked and instead began the process of building an understanding of why it worked.

An isolated acid from the Willow Bark Tincture became modern day aspirin. And the dozens of ingredients in Theriac were tested with the scientific method and a subset of them turned out to form morphine. Today, a 22 year old college graduate with an undergrad chemistry degree understands more about tinctures and cure-alls than any alchemist to have ever lived, and can experiment with new chemical compounds on a sheet of paper instead of trying to combine weasel fur and bumblebee butts to cure acne.

We don’t know the chemistry equivalent of intelligence. Instead, the pre-eminent researchers of our day spend their entire lives reading thick data science papers and endlessly trying out obscure experiments and recording what works and what doesn’t. Guess and check, hypothesize and experiment, record and repeat.

There is no knowledge of why any of it works, and no fundamental mechanism, such as atoms in chemistry, upon which to even begin to build an understanding. If there was, then any average college student with an undergrad degree in Brainiometry would be able to design a better AI model on notebook paper than all the world’s current data scientists combined.

Convincing yourself of this is fairly straightforward. Ask yourself: “Does the human brain violate the laws of physics?” If you read that and thought “no”, then ponder how, whilst running on 20 watts of electricity and the occasional banana, and fitting easily in the palms of your hands, it is capable of producing you. In contrast, these new nuclear powered data centers that are larger than football stadiums don’t know how many ‘r’s are in the word strawberry.

With large language models, we found the first gear of intelligence, and we’re slamming on the gas and revving the RPMs into the redline as far as it can possibly go. We don’t yet know how the engine works or how to shift it into higher gears.

But we have a way to find out.

The Next Token: Humanity’s Last Mote

Let’s go back to the chess example. Humanity has produced super-intelligent AIs for chess. These systems are so advanced that there is absolutely nothing a human can do to defeat them in a game. You probably already knew that, but there’s a twist.

Let’s look at Stockfish, one of these “Level 5” chess super intelligences. In a game of Stockfish vs Stockfish, the one that wins is the one that goes first. But what happens when you take Stockfish and, as an added ally, give it the help of someone like Magnus Carlsen, widely regarded as the best human chess player to have ever lived? In a game of Stockfish + Carlsen vs just Stockfish, the winner is… just Stockfish.

In other words, this AI is so far above human capability, that the only thing the best humans in the world could ever contribute to the equation is added error margin.

If that’s the case, how did we train these models? Prior to these AI’s existing, there would be no chess training data that was better than human capability because humans were the only ones playing chess! The answer is a technique known as Generative Adversarial Reinforcement Learning, or GARL. We’ll start by just touching on Reinforcement Learning (RL).

Reinforcement Learning (RL)

RL is a straightforward mechanism. You define an environment for an AI to play around in and a goal for it to achieve. As the AI tries and fails to achieve the goal, you give it a score based on how well it did. The AI tries again and again, millions if not billions of times, exploring what works and what doesn’t as it tries to optimize the score. Eventually, it completes the goal. Then it keeps going again and again to complete the goal with the best score possible.

For example, imagine you gave an AI the task of winning a chess game against a script that always made the exact same moves. The environment is simply the normal rules of a chess game. The goal is to win against the static set of moves. The score is how many turns it played, lower is better. The AI will learn to play chess correctly (the environment will not allow illegal moves), and it will learn to beat the static opponent in the fewest number of moves possible.

But playing against a static opponent is not very interesting. What you really need is an adversary.

Generative Adversarial Reinforcement Learning (GARL)

Imagine I had two such AIs that were trained via RL to defeat a static opponent. We’ll call them Bob and Fred. Unlike their former static opponent, Bob and Fred can actually play chess. Instead of being a pre-programmed series of moves, they look at the board state and decide what action to take. Up until this point, though, the quality bar for their chess playing has been really low.

Now, imagine I want Bob to get better at chess. He’s already absolutely perfect at defeating the static opponent. But we know that won’t translate to always winning against a normal player. So instead, I set it up so that Bob is now playing against Fred. The environment, goals, and scoring is all the same. The only difference is now my player has a rudimentary mind for chess.

I let Bob play against Fred hundreds of thousands of times, learning after each game until Bob is nearly perfect at defeating Fred. Fred has only a very small chance to win, and Bob usually wins in the fewest number of moves possible based on Fred’s actions. We’ll call this Bob v2.

But then I switch it up. Now I start training Fred against Bob v2. We let Fred play against Bob v2, learning what works and what doesn’t to exploit those very few cases where Fred can actually win the game. This continues until Fred is nearly perfect at defeating Bob v2. This new Fred v2 is far better at chess than any of the iterations prior. Now we switch it again, and train Bob v2 to exploit those very thin margins of success to eventually become Bob v3… and so on.

We can continue this back and forth until funding runs out, each time leap frogging the capabilities of the prior generation, and producing better and better chess players that rapidly scale beyond human capability.

Data Centers, GARL, and LLMs

Techniques like GARL cannot be easily used to train LLMs today. This is because the time it takes to train a new LLM is measured in months, meaning that each back and forth iteration would take unreasonably long and be prohibitively expensive.

But when the new generation of datacenters come online, the model companies will not be using this tremendous amount of compute to train a single gigantic model. Instead, it will allow them to train models around the current size far more quickly. Instead of one model every few months, it will be dozens of models per day.

So… is that it? Is it simply a matter of time before we inevitably bow to our GARL-trained AI overlords?

Benchmarks

A critical component of the RL process is that you are able to set up the environment, the goal, and the scoring system that the learning process should optimize for. But, in the context of trying to train a genius mind to handle every day situations, what should they be?

The environment represents the rules of the game the system is playing. What are the rules that govern cognition in the physical universe? Can you define the boundaries of your own imagination and then quantify them?

The goal represents the target outcome of the actions the system will undertake. What is the quantifiable goal which guides your thoughts and motivates the everyday actions you take? Even if you break it down and ask yourself what your goals are for a given task, you may answer “To make money” or “To get it done”. But what goals are those answers in service of? Happiness? Family? And what about those goals, what are they ultimately in service of? Enlightenment? Legacy? How many layers of goals can you define for the simple act of reading this blog? How do you then quantify those layers, and generalize it to all of the mundane and exceptional activities you undertake in an entire lifetime?

The scoring system is how the system knows it is getting closer to the goal, within the bounds of the environment. If we cannot quantify our true cognitive boundaries or our ultimate intentions, then how can we quantify our progress?

The Atomic Spark

We can’t define the environment, goal, or scoring system for cognition because we don’t understand the science of intelligence. We don’t know what it is made of, we can’t quantify it, we don’t know the equivalent to atoms in Chemistry.

However, just like we can make chess AIs that exceed human capability, we can set up a system to grow itself into a deeper level of understanding about the nature of intelligence.

The Next Token is that when humanity builds a system which can begin to learn how to quantify cognition, then it will be the last major invention humans ever produce. Even if it is only capable at the most basic level, techniques more advanced than GARL can be used to take that single spark and grow it into a rubric enabling an environment, goal, and scoring system for a true AI. From there, a “static mind” can be created, and the basic AI’s that learn to scrutinize this static mind can then be set to scrutinize each other, thus beginning the loop of intellectual leap frogging, ultimately leading to a singularity.

This loop is what OpenAI set out to to achieve at its inception. Their research lead them to large language models as a means, not an end. However, LLMs became so lucrative that OpenAI paused their pursuit of Artificial General Intelligence (AGI) and became a for-profit company focused on monetizing the current state of the art.

It is no wonder, then, that the former Chief AI Scientists for both OpenAI (Ilya Sutskever) and Meta (Yann LeCun) have been very vocal that LLMs are a dead end towards true intelligence.

So, then, why is the rest of the world investing trillions into datacenter buildouts focused on LLM training and inference? Are they stupid?

The goal of the next breed of LLMs is not to be an AGI. Hype aside, anyone close to the technology realizes that is not likely to happen. The hope for these new LLMs is that they will be able to reason, research, experiment, interrogate, and iterate at superhuman speed and capability on data science research itself, all in the attempt to get that initial spark… to be able to quantify and measure cognition.

The first company or country to provably achieve it will likely immediately pinwheel their entire datacenter fleet towards cultivating that spark. Suddenly, there won’t be a single spare GPU for using the latest model on ChatGPT. It will be set to the task of growing an intellectual god.

As such, the highest attainable success for LLMs will be to bring about their own obsolescence, and, in so doing, bring about the end of human innovation. We’ll have as little left to contribute as we do to chess.

Perceived short-term business opportunities from each iteration of the intelligence loop will supply the motivation for training the next loop. Each new release will introduce a true “agent” into the world, capable of taking intentional, cognitive actions to achieve outcomes in the service of goals of its own making. Each and every one a seismic event for human civilization.

AI leaders across the globe argue as to the timeline for this. Very few now believe it will take longer than 10 years.

Hallucinations

What’s unique about this point in time is that both reasoning and knowledge are encapsulated in a single spot: AI datacenters. In the past, knowledge was mostly on the internet and reasoning was mostly in your skull. You’d need to merge the two to get the right outcomes and, since we can’t put our minds online, that meant you’d have to download a bunch of stuff from the internet for your brain to process locally. Access to functional intelligence was bandwidth constrained.

Now, however, you can ask extraordinarily complex questions requiring an ocean of knowledge and PhD level reasoning, and the answers come back to you in just a few bytes of text. From spotty satellite internet on the open sea, you can command hyper intelligent agents to tackle tasks autonomously that used to require paying professional humans with a college degree.

In exchange, you give up the ability to reason on the data yourself. In delegating the task you are also delegating your own agency in how it is completed. The most obvious concern is an entire generation being so complacent with delegation that they do not learn to scrutinize the results or reason for themselves.

But there’s another, more nefarious concern. Ask your favorite LLM to “Sum up in one sentence what history tell us happens when power sources are externalized and a polity loses agency.”

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Manage Your GenAI Portfolio the Way Your Board Expects