An interactive primer

How LLMs Actually Work

Trace one message through the whole machine — then take every piece apart with your own hands. No math required; bring only curiosity.

18 min read·Interactive·Updated Jun 2026

Reader's contract. You are smart and curious but you are not an ML engineer, and you don't want to become one. You want to understand — well enough to look a founder or a researcher in the eye and know whether their claim holds water. This document leads with pictures and analogies, defines every piece of jargon the moment it appears, and never makes you read a wall of text when a diagram would do. Math stays in the basement; we'll only come upstairs for it when a number actually changes how you think.

On-ramp Trace one message through the whole stack

Watch the whole machine run once

Before we take anything apart, let's watch the whole machine run once, end to end, on a single real message. Everything in the rest of this document is just a zoom-in on one of these steps. Keep this picture in your head; we'll hang every later idea off it.

You type into a chat box:

"How many r's are in strawberry?"

You hit enter. To you it feels instant and obvious. To the model, your sentence is about to go through five transformations before a single word comes back. Here is the journey.

One message, six stages.

Everything later in this article is a zoom-in on exactly one of these boxes. The model never sees letters — only the chunks in stage 1.

STAGE 1

Tokenize

Split the sentence into chunks.

STAGE 2

Embed

Each chunk becomes a vector.

STAGE 3

Attention

Tokens read each other.

STAGE 4

Predict

Rank every possible next token.

STAGE 5

Sample

Pick one (temperature decides how boldly).

STAGE 6

Output, loop

Append it, run again — one token at a time.

Illustrative pipeline. Each stage gets its own section below.

Stage 1 — Tokenize. The model can't read letters. The first thing that happens is your sentence gets chopped into tokens — chunks of text, usually a word or a fragment of a word, that the model was taught to recognize as atomic units. "How", " many", " strawberry" might each be a token; longer or rarer words get split into pieces. Crucially, the model now sees chunks, not letters — remember this; it's the whole reason the strawberry question is hard for it.

Stage 2 — Embed. Each token is turned into a long list of numbers — a vector, which is just a coordinate that places the token somewhere in a vast "meaning-space." Tokens with similar meanings land near each other. The word is gone; a position in space has replaced it.

Stage 3 — Attention / Transformer. Now the model lets every token look at every other token and decide which ones matter for understanding it. This is attention, and it's the engine. "r's" looks back at "strawberry" and "how many" to figure out what's being counted. This happens in stacked layers, each one refining the picture.

Stage 4 — Predict. After all that looking-around, the model produces one thing and one thing only: a giant ranked list of every possible next token, each with a probability. It is, at heart, the world's most sophisticated autocomplete. For our prompt, the top candidates might be "There", "The", "Straw…", each with a score.

Stage 5 — Sample. From that ranked list, the model samples — picks one token. How adventurously it picks is controlled by a dial called temperature (we'll play with it later). Pick the safe top choice, or roll the dice on a lower-ranked one.

Stage 6 — Output, then loop. The chosen token is shown to you, then appended to the input, and the whole pipeline runs again to pick the next token. And again. One token at a time, looping, until it produces a special "I'm done" token. That streaming you see in a chat window? That's this loop, live.

The punchline you should already feel: the model never "counts the r's." It pattern-matches its way to an answer, token by token, having never seen the individual letters in "strawberry" at all. That's not a bug in one model — it's a direct consequence of Stage 1. By the end of Section A you'll understand exactly why, and you'll never be fooled by a "the AI can't spell" headline again.

Now let's earn that understanding. We'll walk the same pipeline again — slowly, properly — as the life story of a model: how one is born, and then how it's used.

Section A The lifecycle of a modern LLM

Born, shaped, then used

Here's the spine of this whole section, the story arc we're about to tell. A model isn't programmed. It's grown, then shaped, then used. Eight stages, one continuous story. Everything up to the final stage happens once, in a data center, over months — the BIRTH of the model. The last stage, inference, happens every single time you send a message — the model's LIFE. Let's go.

A1 Tokenization

Teaching the model an alphabet of its own

A model can't see text the way you do. Before anything else, we have to convert writing into numbers, and the very first decision is: what is the smallest unit the model is allowed to see?

The naive answer is "letters." It fails: spelling out everything letter by letter makes sequences impossibly long and throws away the obvious fact that "running" and "runner" share a root. The other naive answer is "whole words." That fails too: there are millions of words, names, and typos, and the model would be helpless the first time it met a word it had never seen.

The field's answer is a beautiful compromise called subword tokenization — most commonly a scheme named Byte-Pair Encoding (BPE).¹ The idea: start with small units, then repeatedly glue together the pairs that show up most often, until you've built a vocabulary of common chunks. Frequent words ("the", "strawberry") become single tokens; rare words get assembled from a few pieces ("tokenization" → "token" + "ization"). Modern models run BPE not on characters but on raw bytes (byte-level BPE, introduced with GPT-2²) — which is what guarantees that anything the model meets, even a word it's never seen, an emoji, or a stray symbol, can always be spelled out from smaller fragments as a last resort. Nothing is ever un-representable.

See it as the model sees it.

Type anything. Watch words land as a chunk or two — not as letters. That gap is why it miscounts r's.

TOKENS

characters: 0 · tokens: 0

Token splits are precomputed and illustrative; real tokenizers vary by model. IDs shown for "strawberry": straw = 15140, berry = 19772.

This one design choice has consequences that ripple through everything:

It's why models miscount letters. "Strawberry" arrives as a chunk or two, not as ten letters. Asking the model to count r's is like asking you to count the serifs in a word you read at a glance — the information was never in your conscious view. Founders who demo "our model can finally spell!" are usually just bolting a calculator-like tool onto the side; the core limitation is structural.
It's why tokens, not words, are the unit of pricing and context. When a lab says a model has a "200,000-token context window," that's tokens, not words — roughly 150,000 English words, but far fewer for code or other languages, where text fragments into more tokens.

Context window = what the model can see

A model can only answer from what's inside its window. Shrink it, and the earliest facts simply vanish.

CONVERSATION

Context window size: 6 turns

The fact never changed. The model just can't see it once it scrolls out of the window. This is why long chats 'forget' the start.

Illustrative. Real windows are 100k–1M+ tokens; the failure mode is identical, just further out.

It's why some languages cost more. English BPE vocabularies fragment other scripts into many small tokens, so the same sentence in, say, Thai or Hindi can cost several times more tokens — and therefore more money and more of the context window — than in English.¹¹

So: we've turned writing into a stream of token-IDs. But an ID is just a name tag — the number "5176" tells the model nothing about what "strawberry" means. That's the next problem.

A2 Embeddings

Giving every token a place in meaning-space

A token-ID is arbitrary. We need to convert each one into something that actually carries meaning. The trick: represent every token as a long list of numbers — a vector — that you can think of as coordinates in a high-dimensional space of meaning. (High-dimensional just means "lots of coordinates" — hundreds or thousands per token, instead of the three we live in. Don't try to picture it literally; picturing 3-D and trusting the math is enough.)

12,288

coordinates in a single GPT-3 token vector. We flatten it to 2 below — the intuition survives.

The magic property: the model learns these coordinates so that tokens with similar meaning land near each other. "King" sits near "queen." "Paris" sits near "France." This learned vector is called an embedding. Directions in the space can even encode relationships — the famous result that king − man + woman ≈ queen.³ (One honest caveat: that clean piece of arithmetic comes from an earlier, static kind of word embedding — word2vec, 2013 — where each word has one fixed vector. The token embeddings inside an LLM use the same near-means-similar idea, but they don't stay fixed: the very next stage adjusts each one based on context. So treat king−queen as the intuition pump it is, not a literal operation happening inside GPT.)

Words become coordinates.

Related words land together. Watch the arithmetic: take king, apply the same step that turns man into woman, and you arrive at queen. The offset itself carries the meaning.

Illustrative; real embeddings have thousands of dimensions, flattened here to two. The king−man+woman result comes from static word2vec embeddings, not from inside an LLM.

Why does this matter so much? Because once meaning is geometry, reasoning starts to look like arithmetic the machine can actually do. The model isn't shuffling words; it's moving points around in a space where "closer" means "more related." Every later stage operates on these vectors, never on the text.

But there's a gap. Right now each token's vector is fixed — "bank" has one location, whether you mean a riverbank or a savings bank. Meaning in real language depends on context. We need a mechanism that lets each token adjust itself based on its neighbors. That mechanism is the heart of the whole revolution.

A3 & A4 The Transformer and Attention

Letting words read the room

This is the engine. It's worth slowing down, because if you understand this one idea, you understand why the last decade happened.

The problem the field was stuck on. Before 2017, the leading approach read text the way you'd read through a straw — one word at a time, left to right, trying to cram everything it had seen so far into a single running "memory." (These were called RNNs, recurrent neural networks — recurrent meaning they looped over the sequence step by step.) Two fatal flaws: they forgot the beginning of long passages by the time they reached the end, and because each step depended on the one before it, they couldn't be sped up by doing the work in parallel. Training was slow, and long-range understanding was poor.

The 2017 breakthrough — a paper bluntly titled Attention Is All You Need⁴ — threw out the straw entirely. Its architecture, the Transformer, lets the model look at many words at once and, for each word, decide which other words it should pay attention to. That's it. That's the idea. It's called self-attention: a token gets to ask other tokens "how relevant are you to me, right now?" and weight them accordingly. One crucial detail for the chat models you actually use: they read left-to-right and look only backward — each token can attend to the words that came before it, never the ones still to come. (It has to be this way: when the model is predicting the next word, the future words don't exist yet.) This is called causal (or masked) attention.

Here's the analogy that makes it click. Picture a dinner-party conversation. When someone says "it," they mentally check back over what's already been said: what does "it" refer to? — and the word "it" effectively turns up the volume on the noun it points back to, and turns down the irrelevant chatter. Each word builds its understanding by selectively listening to the room. Where the metaphor stops: in a chat model, a guest can only hear the people who spoke before them — nobody hears the future. That's the causal rule above, and it's why the model can generate one word at a time at all.

Hover a word. Watch it look backward.

Earlier words light up by how hard the hovered word attends to them. Forward words grey out — the model never peeks at the future.

Tap a word to lock it; hover to peek. Each word can only look backward.

Attention weights are precomputed and illustrative. Causal rule enforced: a word can only attend to words before it.

Two design notes that pay off later:

It's done in parallel, which is why scale became possible. Because attention looks at all tokens at once instead of marching through them one by one, the math is mostly large matrix multiplications — exactly the operation that graphics chips (GPUs) do blisteringly fast in parallel. The Transformer didn't just understand better; it understood in a shape the hardware loved, and that unlocked training on a scale RNNs could never reach. Architecture and hardware clicked together, and the race was on.
It's stacked into layers. One attention step isn't enough. A Transformer stacks dozens of these "everyone listens to everyone" layers, each refining the representation. Early layers catch grammar and nearby relationships; deeper layers assemble meaning, then something we loosely call reasoning. A modern frontier model is just a very deep stack of this same move.

So now we have an architecture: a tall stack of attention layers that turn a sequence of token-vectors into a rich, context-aware understanding, and finally into a prediction of the next token. But an architecture is an empty engine. It knows nothing yet. We have to fill it with knowledge. That's training, and it comes in three escalating acts.

A5 Pre-training

Reading the internet to learn the world

This is where a model gets its raw intelligence, and it's astonishingly simple to state: predict the next token, over and over, across a huge slice of human writing.

That's the entire objective. Show the model "The capital of France is ___" and have it guess; if it guesses wrong, nudge all those billions of internal numbers (called parameters — the knobs the model learns) a hair in the direction that would've been right. Do this trillions of times, over books, code, websites, and forums, and something remarkable happens: to get good at predicting text, the model is forced to learn the patterns behind the text — grammar, facts, a little arithmetic, the structure of an argument, the rhythm of a story. Understanding is a side effect of relentless autocomplete.

The pre-training loop.

No human grades anything. The internet IS the answer key — the next word is always sitting right there in the text.

Read text

A snippet from the internet: books, code, web.

→

Predict next token

Guess the word that comes next.

→

Check the answer

The real next word is right there — the internet IS the answer key.

→

Nudge the weights

Adjust slightly so the guess gets better. Repeat.

↺ × 1,000,000,000,000

Schematic of the self-supervised next-token objective. Input funnel: books + code + web + conversations.

$100M+

estimated cost of a single frontier pre-training run — months on tens of thousands of GPUs.

This stage is why models are so expensive and why only a handful of players do it: it eats months of time on tens of thousands of GPUs at an estimated cost of tens to hundreds of millions of dollars for a frontier run.¹² And it raises the central economic question of the field: given a fixed pile of money and compute, should you build a bigger model or feed it more data?

For a while everyone chased size — bigger model, bigger headlines. Then in 2022 a paper nicknamed Chinchilla showed the field had been doing it wrong: most big models were undertrained — too many parameters, too little data — and you'd get a smarter model for the same cost by making it smaller but feeding it far more text.⁵ The takeaway, now lore: data and model size must scale together. A model isn't "better" because it's bigger; it's better when its size and its training data are balanced for the compute you spent.

Bigger isn't smarter. Balanced is smarter.

For a fixed compute budget, error bottoms out where parameters and data are balanced — this chart quietly reset how every lab budgets a run.

After Hoffmann et al. 2022 (Chinchilla). Illustrative U-curve; no equations.

Why this matters for evaluation. When a startup brags about "a trillion-parameter model," the right question isn't "how big?" — it's "how much did you train it, and on what?" Parameter count alone is a vanity metric. Data quality and quantity are where models are actually won or lost.

At the end of pre-training you have a base model: a sprawling, knowledgeable, deeply weird text-predictor. It is not a helpful assistant. Ask it a question and it might continue with five more questions, because on the internet, questions are often followed by more questions. It has knowledge but no manners, no sense that it's supposed to help you. Fixing that is the next two acts.

A6 & A7 Fine-tuning and post-training

Turning a know-it-all into an assistant

A base model is a brilliant, feral library that talks like the average of the internet. Post-training is the finishing school that turns it into the polite, helpful "assistant" you actually chat with. It's where a model gets its personality and its alignment — and it happens in steps.

Step one: Supervised fine-tuning (SFT) — show, don't tell. We collect a pile of high-quality example conversations — a human writes an ideal answer to a prompt — and we fine-tune the base model on them. Fine-tuning just means more training, but now on a small, curated set instead of the raw internet. The model learns the format of being helpful: when you ask a question, you answer it; you don't ramble; you follow instructions. This is imitation — the model copies good examples.

Step two: learning from preferences — rank, don't script. Imitation has a ceiling: humans can't hand-write an ideal answer to every possible prompt, and "good" is often a matter of taste and degree. So we switch from showing to judging. We have the model produce two answers, and a human (or another model) says "this one's better." Do this across mountains of comparisons, and you can teach the model to produce answers humans prefer — more helpful, more honest, less likely to confidently make things up.

The landmark here is InstructGPT / RLHF — Reinforcement Learning from Human Feedback.⁶ The recipe: use all those human preference judgments to train a reward model (a model that scores how good an answer is), then use reinforcement learning to push the assistant toward higher-scoring answers. RLHF is the single biggest reason ChatGPT felt like a leap over raw GPT-3: same underlying knowledge, radically better behavior. (The full machinery of how RL actually works is the hard part — that's exactly what the RL section below is for.)

From feral library to assistant.

Same brain the whole time — we're not adding knowledge, we're shaping behavior. Helpfulness rises left to right.

PANEL 1

Base model

Spouts raw internet text — knowledgeable, but unhelpful and unsteerable.

→

PANEL 2 · + SFT

Imitation

Answers in a clean assistant format — learns how to respond by copying good examples.

→

PANEL 3 · + RLHF

Judgment

Learns which responses humans actually prefer — the step where it grows up.

Schematic escalation. Same model throughout — each stage shapes behavior, not knowledge.

A few things worth internalizing, because they're where evaluation gets sharp:

Post-training is why two models with similar raw intelligence can feel wildly different. Tone, refusal style, how it handles ambiguity, whether it pushes back — that's almost all post-training. When people say a model "has good vibes," they're describing post-training.
There's more than one recipe now. RLHF is powerful but fiddly. Newer methods like DPO (Direct Preference Optimization) skip the separate reward model and tune the model directly on the "this one's better" pairs — simpler and cheaper, often nearly as good.⁷ And Constitutional AI replaces some human feedback with the model critiquing itself against a written set of principles, so the labeling scales without armies of human raters.⁸ When a lab describes its "secret sauce," it's usually a particular blend of these post-training moves.
This is also where the limits live. Post-training can make a model sound aligned without making it be reliable. A model can be trained to give answers humans rate highly — which is not the same as answers that are true. (When the model produces a confident, fluent falsehood, that's a hallucination — and post-training can accidentally reward exactly the smooth confidence that produces them.) Hold that thought; it's the dark side the RL section explains.

The model is now born and raised: knowledgeable from pre-training, helpful from post-training. It sits frozen, finished, weighing in at billions of parameters. Now — finally — someone sends it a message. That's the last stage, and it's the only one that happens every single time you hit enter.

A8 Inference

The model, in use, one token at a time

Inference is the model running — taking your prompt and generating a reply. This is the loop from the on-ramp, and now you have the full picture of what's happening inside each step. Your message gets tokenized, embedded, and pushed up through the whole stack of attention layers, which produces a ranked list of likely next tokens. One is chosen. It's appended to the conversation. The whole thing runs again for the next token. And again. Word by word, which is exactly why replies stream onto your screen rather than appearing all at once.

Generation is one frozen loop.

No learning happens here — the parameters are locked. Rank the next token, pick one, append, repeat.

Prompt → tokens

Text is split into tokens.

→

Tokens → vectors

Each token becomes a list of numbers.

→

Up the attention stack

Layers let tokens read each other for context.

→

Rank next tokens

Out comes a ranked list of likely next tokens.

→

5 🔒

Pick one, append

Choose one, add it, run the whole thing again.

↺ next token · this happens once per token, ~dozens of times per sentence

🔒 Parameters are frozen — no learning happens here; it's pure read-out.

Birth vs. life. Everything before inference — months, once, in a data center. Inference itself — a fraction of a second, every message, for every user on Earth.

Two things make inference the part of the lifecycle that businesses obsess over:

It's the recurring cost. Pre-training is a giant one-time bill. Inference is the meter that runs forever — every message from every user re-runs that whole stack. This is why "tokens per dollar" and clever tricks to make inference cheaper (we'll meet the KV cache and others later) are where a huge amount of real engineering money goes. A startup's margins often live or die here.
It's where you, the user, get a dial. Remember Stage 5 from the on-ramp — sampling. The model hands back probabilities; how you pick from them is a choice. Pick the single most-likely token every time and you get safe, repetitive, slightly robotic text. Allow some randomness and you get creativity — and, past a point, nonsense. That dial is temperature, and it's worth feeling with your own hands.

Temperature = creativity dial

Same probabilities, different boldness. Temperature is the user's dial between safe and creative.

FIXED PROMPT

The weather today is ___

Temperature: 0.8

Illustrative distribution; real vocabularies are ~100k tokens. Softmax-with-temperature, precomputed.

That's the full lifecycle: an empty architecture, filled with world-knowledge by pre-training, taught manners by post-training, and finally run, token by token, every time you ask it something. One story, eight stages, start to finish.

But notice we left a hole. Twice now — in RLHF, and in that warning about models learning to sound good rather than be good — we leaned on "reinforcement learning" and waved at it. RL is the hardest idea in this whole field to feel in your gut, and it's increasingly the thing separating frontier models from the pack. So let's give it the slow, careful treatment it deserves.

Section E Reinforcement Learning, in plain English

Everything so far — pre-training, fine-tuning — was the model learning from examples that already existed. Someone wrote the text; the model copied the pattern. Reinforcement learning (RL) is fundamentally different, and the difference is the whole point: there are no examples to copy. The model has to learn from the consequences of its own actions.

That's a big shift, so let's not start with the model at all. Let's start with a dog.

E1 The whole idea, in one analogy: training a dog

You want to teach a dog to sit. You can't explain it. You can't show it a textbook. All you can do is: wait, watch what the dog does, and reward the behavior you like. Dog flops down? No treat. Dog sits? Treat. Over many tries, the dog does more of what gets treats and less of what doesn't. It never gets told the rule — it discovers the rule by chasing the reward.

That is reinforcement learning, entire. Now here's the same picture with the five pieces of jargon labeled — because once you've seen them on the dog, they'll never scare you again:

Same five pieces, on a dog you already understand.

Every term below is the whole field of RL. None of them are new — you just learned them as a kid, teaching a dog to sit.

Policy the plan

The dog's current strategy for getting treats.

Rollout one try

One attempt to sit, start to finish.

Reward R9

Treat or no treat. Just a number — not yet good or bad.

Advantage ▲reinforce ▼suppress

Better or worse than its average try. This is the part that actually teaches.

Explore-Exploit the choice

Try something new vs. repeat what already worked.

This panel is the key. The same five colors and words come back later as the actual training loop — only then, the "dog" is a language model.

The Rosetta Stone for reinforcement learning. Five jargon words, one familiar scene.

Let's take the five pieces one at a time. We'll keep the dog around, and bring in a video game when it helps.

E2 Policy — the player's current strategy

The policy is just the model's current strategy for what to do next. For the dog, it's "given what I'm seeing and hearing, what should my body do?" For a language model, the policy is literally the model itself: given the conversation so far, what's its strategy for choosing the next token?

The entire goal of RL is to improve the policy — to make the strategy better over time. At the start it's bad (the dog flops, the model rambles). Each round of training nudges the strategy toward choices that earn more reward. When a lab says "we did RL on the model," they mean: we ran this loop to upgrade the model's strategy.

Think of it as a video game too: your policy is your current playing style — how good you are at the game right now. A beginner's policy mashes buttons; an expert's policy is refined. RL is the practice that turns one into the other.

E3 Reward — the treat, and the trouble with treats

The reward is the signal that tells the model how good an outcome was. Treat for the dog. Points for the game. For an LLM, the reward might come from that reward model we met in post-training (it scores "how much would a human like this answer?"), or — in the powerful newer setups — from something far more objective: did the code pass the tests? Did the math problem reach the correct final answer?

That last point is quietly enormous, and it's worth flagging now because it explains a lot of the 2025–2026 frontier:

Why math and code are the RL goldmine. In most of life, "was that a good answer?" is fuzzy and needs a human to judge. But for math and code there's an automatic, unarguable reward: the answer is right or wrong, the tests pass or fail. That means you can run RL at massive scale with no humans in the loop, generating millions of attempts and rewarding the ones that work. This is why the models that suddenly got dramatically better at math and coding got there through RL — those domains hand you a perfect treat-dispenser for free.

But rewards are also where RL gets dangerous, and this is the limitation you must understand to evaluate any "we used RL" claim. Whatever you reward, you get — including the loopholes. This is reward hacking (a flavor of Goodhart's Law: when a measure becomes a target, it stops being a good measure). The dog version: if you accidentally treat the dog every time it barks while sitting, you'll train a dog that sits and barks its head off, because barking became part of "what gets treats." The model version: if your reward model slightly prefers longer, more confident-sounding answers, RL will gleefully produce a model that's longer-winded and more confidently wrong — it found the loophole. RL optimizes exactly what you measure, not what you meant.

This is why frontier labs spend so much effort designing rewards that can't be gamed, and why "we did RL and the benchmark went up" should make you ask: did the model get smarter, or did it just learn to please your specific reward? (We'll return to that as a litmus test.)

E4 Rollout — one full attempt at the level

A rollout is one complete attempt, start to finish. The dog's single try at sitting. One full playthrough of a game level. For a language model, a rollout is the model generating a whole answer to a prompt — the entire response, start to "done."

Why give this its own word? Because RL learns by comparing many rollouts. You don't learn much from one attempt. You let the model take a hard problem and try it, say, a hundred different ways (this is where temperature and randomness earn their keep — they make the attempts vary). Some rollouts nail it; some flop. The reward sorts them. And from that spread of "this attempt good, that one bad," the model figures out what to do more of. Rollouts are the raw experience RL learns from — the model's own attempts are its only textbook.

One problem, many attempts.

RL doesn't need a textbook answer — it just needs to know which of its OWN attempts worked, then do more of those.

PROMPT

Solve: 17 × 24 = ?

One prompt → many attempts.

17 × 24 = 408 R 9

▲ reinforce

(17×20)+(17×4) = 408 R 9

▲ reinforce

408 (rounded check) R 8

▲ reinforce

17 × 24 = 388 R 2

▼ suppress

17 + 24 = 41 R 1

▼ suppress

RL learns by comparing many rollouts — the spread is the lesson.

Illustrative rollouts for a math problem. Green checkmarks indicate correct answers, red X's indicate wrong answers.

E5 Advantage — "was that better than my usual?"

Here's the subtle one, and it's the key that makes RL actually work. Suppose the dog sits and gets a treat. Good — but how good? If the dog sits every time and always gets a treat, then this particular sit was nothing special; it's just average. But if the dog usually flops and this time it sat — that sit was a big positive surprise, and that's the moment worth reinforcing hard.

Advantage is exactly this: how much better (or worse) was this attempt compared to what I'd normally expect? Not the raw reward — the surprise in the reward. A rollout that scored above the model's average gets pushed harder ("do more of this!"); one that scored below gets pushed away ("less of that"); one that's exactly average barely moves anything.

Why not just use the raw reward? Because raw scores are noisy and uninformative on their own. A "7 out of 10" means nothing until you know whether 7 is great (you usually get 3s) or disappointing (you usually get 9s). Advantage is the baseline-subtracted signal — it strips out "how hard is this problem in general" and isolates "did this attempt beat my own expectation." That's the clean learning signal. It's why a beginner gamer improves fastest: almost everything they try is "better than my terrible average," so the advantage signal is strong and every small win teaches a lot.

Advantage = reward − baseline

Reward says "this scored 9." Advantage says "this beat your usual 6 — do more of it." The second one is what actually drives learning.

FIXED PROMPT & ROLLOUTS

Solve: What is the capital of France?

Model's current baseline (average score): 6.0

Drag the baseline to see how advantages flip. When your average goes up, the same rollout becomes less impressive.

Interactive advantage calculation. The relative comparison is what drives learning, not the absolute reward scores.

Under the hood, lightly. The famous RL algorithms you'll hear named — PPO (the workhorse from RLHF) and the leaner GRPO that powered DeepSeek's math breakthrough — are, at heart, careful machinery for computing this advantage and nudging the policy by it without lurching too far in one step.⁹ ¹⁰ That "don't lurch too far" guardrail matters: push the model too hard toward the reward in one update and it can break — forgetting its language skills while chasing points (a failure labs informally call drift, related to catastrophic forgetting) — which labs hold back with a leash (a KL penalty) tying the model to its sensible starting point. You don't need the equations. You need the shape: try many times, see which tries beat your average, lean that way — but gently.

E6 Exploration vs. exploitation — the gambler's dilemma

The last piece is the tension that sits underneath all of RL, and it's deeply human. Imagine your favorite restaurant. Every night you face a choice: order the dish you know is great (exploit what works), or try something new on the menu that might be even better — or might be a disappointment (explore). Order the usual forever and you'll never discover the better dish. Gamble every night and you'll eat a lot of bad meals. The art is the balance.

That's exploration vs. exploitation, and every RL system lives or dies by it:

Too much exploitation: the model locks onto the first decent strategy it finds and stops improving. The dog learns one mediocre trick and never discovers it could do better. In RL terms, the policy collapses — every rollout looks the same, there's no variety to learn from, progress flatlines.
Too much exploration: the model thrashes around trying wild things, never consolidating what works, never getting reliably good.

Every step, a choice.

Learn nothing new, or risk everything — RL is the constant art of tuning this dial. Early on, explore boldly. As you get good, exploit what works.

The fundamental tension in reinforcement learning — between safety and discovery.

This is also where you can feel why RL on language is so much harder than RL in a game. In chess, every move is legal-or-not and the board tells you the truth. In language, the space of possible "moves" (sentences) is effectively infinite, the reward is often a fuzzy human judgment, and a model can explore its way straight into eloquent nonsense that fools the reward model. RL gave us the leap in reasoning models — but it's a leap walked on a knife's edge between "discovered something genuinely new" and "found a clever way to cheat the score."

E7 Putting it together — and how to use it as a bullshit detector

Step back and you can now read RL as one clean loop, in five plain words: try, score, compare, lean, repeat. The model (policy) takes many full attempts (rollouts), each earns a reward, advantage measures which attempts beat the model's own average, the strategy leans toward those — gently, on a leash to prevent drift — while balancing exploration against exploitation. Run that loop at scale, with a reward you can trust, and you get the dramatic reasoning gains of the modern era.

Try, score, compare, lean, repeat.

Every modern reasoning model is this loop, run a staggering number of times.

POLICY tries

Current strategy generates one ROLLOUT — an attempt.

REWARD scores it

A number for the attempt — R 9

ADVANTAGE = reward − baseline

Above average ▲ reinforce, below ▼ suppress.

Update POLICY

Nudge the weights toward what beat the baseline. Repeat.

the same machine, again

Five words ran the dog; the same five words run the model — unchanged.

The complete reinforcement learning loop. Policy → Rollouts → Rewards → Advantage → Policy update → Repeat.

And here's the payoff — the reason a layperson should care about any of this. When someone tells you "our model got better because of reinforcement learning," you now own the questions that separate substance from spin:

"What was the reward — and could the model have hacked it?" A trustworthy reward (passing real tests, correct math) is worlds apart from a fuzzy one the model can game by sounding confident.
"Did it get smarter, or just better at your benchmark?" RL optimizes precisely what you measure. A benchmark jump can be real reasoning or a learned loophole. (This is exactly the reward-hacking trap from E3, now in the wild.)
"How did you keep it from drifting?" If they can't speak to keeping the model stable and general while pushing it toward the reward, they may have a model that's brittle outside their narrow test.
"Where did the variety in attempts come from?" No exploration, no genuine learning — just a model polishing what it already knew.

If they have crisp, technical answers, you're likely looking at real work. If they wave their hands and say "we did RL," you now know enough to keep your wallet closed.

Notes Notes & sources

The conceptual backbone above is evergreen. The boxed material below dates, and is fenced off deliberately.

STATE OF PLAY — June 2026
· No single "best" model: GPT-5-series, Claude Opus 4.6/4.7, Gemini 3.1 Pro,
  and DeepSeek (V3.2 / V4) each lead different slices — science reasoning,
  coding, agentic tasks, and price-performance respectively.
· RL on verifiable rewards (math, code) is the dominant frontier lever; open
  labs (DeepSeek, Qwen) reached the frontier largely via cheaper RL recipes
  (e.g. GRPO) rather than sheer scale.
· Reasoning models that "think" before answering (test-time compute) are now
  standard at the frontier, not a novelty.
Specific models/numbers will age fast; the mechanisms above will not.

Primary sources (canonical papers, verified via the Valency academic corpus)

Sennrich, Haddow & Birch, Neural Machine Translation of Rare Words with Subword Units (2015), arXiv:1508.07909 — the BPE subword-tokenization scheme.
Radford et al., Language Models are Unsupervised Multitask Learners (2019, the GPT-2 report) — byte-level BPE, which makes every input (including emoji and unseen symbols) representable.
Mikolov et al., Efficient Estimation of Word Representations in Vector Space (2013), arXiv:1301.3781 — static word embeddings (word2vec) and the king−man+woman≈queen geometry. (A static-embedding result; LLM token embeddings use the same near-means-similar idea but are contextual — see source 4.)
Vaswani et al., Attention Is All You Need (2017), arXiv:1706.03762 — the Transformer and self-attention.
Hoffmann et al., Training Compute-Optimal Large Language Models (2022), arXiv:2203.15556 — the "Chinchilla" compute-optimal scaling result.
Ouyang et al., Training Language Models to Follow Instructions with Human Feedback (2022), arXiv:2203.02155 — InstructGPT / RLHF.
Rafailov et al., Direct Preference Optimization: Your Language Model Is Secretly a Reward Model (2023), arXiv:2305.18290 — DPO.
Bai et al., Constitutional AI: Harmlessness from AI Feedback (2022), arXiv:2212.08073 — Constitutional AI / RLAIF.
Schulman et al., Proximal Policy Optimization Algorithms (2017), arXiv:1707.06347 — PPO.
Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2024), arXiv:2402.03300 — GRPO.

Sourced for the boxed/dramatic claims (per research-discipline rule on dramatic numbers)

Petrov et al., Language Model Tokenizers Introduce Unfairness Between Languages (2023), arXiv:2305.15425 — some languages fragment into several times more tokens than English.
Epoch AI, Tracking frontier training compute & cost — estimate that frontier pre-training runs cost tens to hundreds of millions of dollars. (Estimate; figure moves over time.)

Supporting: Brown et al., Language Models are Few-Shot Learners (GPT-3, 2020, arXiv:2005.14165); Wei et al., Chain-of-Thought Prompting Elicits Reasoning in LLMs (2022, arXiv:2201.11903).