The dog's current strategy for getting treats.
An interactive primer
Trace one message through the whole machine — then take every piece apart with your own hands. No math required; bring only curiosity.
Reader's contract. You are smart and curious but you are not an ML engineer, and you don't want to become one. You want to understand — well enough to look a founder or a researcher in the eye and know whether their claim holds water. This document leads with pictures and analogies, defines every piece of jargon the moment it appears, and never makes you read a wall of text when a diagram would do. Math stays in the basement; we'll only come upstairs for it when a number actually changes how you think.
On-ramp Trace one message through the whole stack
Before we take anything apart, let's watch the whole machine run once, end to end, on a single real message. Everything in the rest of this document is just a zoom-in on one of these steps. Keep this picture in your head; we'll hang every later idea off it.
You type into a chat box:
"How many r's are in strawberry?"
You hit enter. To you it feels instant and obvious. To the model, your sentence is about to go through five transformations before a single word comes back. Here is the journey.
One message, six stages.
Everything later in this article is a zoom-in on exactly one of these boxes. The model never sees letters — only the chunks in stage 1.
Illustrative pipeline. Each stage gets its own section below.
Stage 1 — Tokenize. The model can't read letters. The first thing that happens is your sentence gets chopped into tokens — chunks of text, usually a word or a fragment of a word, that the model was taught to recognize as atomic units. "How", " many", " strawberry" might each be a token; longer or rarer words get split into pieces. Crucially, the model now sees chunks, not letters — remember this; it's the whole reason the strawberry question is hard for it.
Stage 2 — Embed. Each token is turned into a long list of numbers — a vector, which is just a coordinate that places the token somewhere in a vast "meaning-space." Tokens with similar meanings land near each other. The word is gone; a position in space has replaced it.
Stage 3 — Attention / Transformer. Now the model lets every token look at every other token and decide which ones matter for understanding it. This is attention, and it's the engine. "r's" looks back at "strawberry" and "how many" to figure out what's being counted. This happens in stacked layers, each one refining the picture.
Stage 4 — Predict. After all that looking-around, the model produces one thing and one thing only: a giant ranked list of every possible next token, each with a probability. It is, at heart, the world's most sophisticated autocomplete. For our prompt, the top candidates might be "There", "The", "Straw…", each with a score.
Stage 5 — Sample. From that ranked list, the model samples — picks one token. How adventurously it picks is controlled by a dial called temperature (we'll play with it later). Pick the safe top choice, or roll the dice on a lower-ranked one.
Stage 6 — Output, then loop. The chosen token is shown to you, then appended to the input, and the whole pipeline runs again to pick the next token. And again. One token at a time, looping, until it produces a special "I'm done" token. That streaming you see in a chat window? That's this loop, live.
The punchline you should already feel: the model never "counts the r's." It pattern-matches its way to an answer, token by token, having never seen the individual letters in "strawberry" at all. That's not a bug in one model — it's a direct consequence of Stage 1. By the end of Section A you'll understand exactly why, and you'll never be fooled by a "the AI can't spell" headline again.
Now let's earn that understanding. We'll walk the same pipeline again — slowly, properly — as the life story of a model: how one is born, and then how it's used.
Section A The lifecycle of a modern LLM
Here's the spine of this whole section, the story arc we're about to tell. A model isn't programmed. It's grown, then shaped, then used. Eight stages, one continuous story. Everything up to the final stage happens once, in a data center, over months — the BIRTH of the model. The last stage, inference, happens every single time you send a message — the model's LIFE. Let's go.
A1 Tokenization
A model can't see text the way you do. Before anything else, we have to convert writing into numbers, and the very first decision is: what is the smallest unit the model is allowed to see?
The naive answer is "letters." It fails: spelling out everything letter by letter makes sequences impossibly long and throws away the obvious fact that "running" and "runner" share a root. The other naive answer is "whole words." That fails too: there are millions of words, names, and typos, and the model would be helpless the first time it met a word it had never seen.
The field's answer is a beautiful compromise called subword tokenization — most commonly a scheme named Byte-Pair Encoding (BPE).1 The idea: start with small units, then repeatedly glue together the pairs that show up most often, until you've built a vocabulary of common chunks. Frequent words ("the", "strawberry") become single tokens; rare words get assembled from a few pieces ("tokenization" → "token" + "ization"). Modern models run BPE not on characters but on raw bytes (byte-level BPE, introduced with GPT-22) — which is what guarantees that anything the model meets, even a word it's never seen, an emoji, or a stray symbol, can always be spelled out from smaller fragments as a last resort. Nothing is ever un-representable.
See it as the model sees it.
Type anything. Watch words land as a chunk or two — not as letters. That gap is why it miscounts r's.
Token splits are precomputed and illustrative; real tokenizers vary by model. IDs shown for "strawberry": straw = 15140, berry = 19772.
This one design choice has consequences that ripple through everything:
Context window = what the model can see
A model can only answer from what's inside its window. Shrink it, and the earliest facts simply vanish.
Illustrative. Real windows are 100k–1M+ tokens; the failure mode is identical, just further out.
So: we've turned writing into a stream of token-IDs. But an ID is just a name tag — the number "5176" tells the model nothing about what "strawberry" means. That's the next problem.
A2 Embeddings
A token-ID is arbitrary. We need to convert each one into something that actually carries meaning. The trick: represent every token as a long list of numbers — a vector — that you can think of as coordinates in a high-dimensional space of meaning. (High-dimensional just means "lots of coordinates" — hundreds or thousands per token, instead of the three we live in. Don't try to picture it literally; picturing 3-D and trusting the math is enough.)
The magic property: the model learns these coordinates so that tokens with similar meaning land near each other. "King" sits near "queen." "Paris" sits near "France." This learned vector is called an embedding. Directions in the space can even encode relationships — the famous result that king − man + woman ≈ queen.3 (One honest caveat: that clean piece of arithmetic comes from an earlier, static kind of word embedding — word2vec, 2013 — where each word has one fixed vector. The token embeddings inside an LLM use the same near-means-similar idea, but they don't stay fixed: the very next stage adjusts each one based on context. So treat king−queen as the intuition pump it is, not a literal operation happening inside GPT.)
Words become coordinates.
Related words land together. Watch the arithmetic: take king, apply the same step that turns man into woman, and you arrive at queen. The offset itself carries the meaning.
Illustrative; real embeddings have thousands of dimensions, flattened here to two. The king−man+woman result comes from static word2vec embeddings, not from inside an LLM.
Why does this matter so much? Because once meaning is geometry, reasoning starts to look like arithmetic the machine can actually do. The model isn't shuffling words; it's moving points around in a space where "closer" means "more related." Every later stage operates on these vectors, never on the text.
But there's a gap. Right now each token's vector is fixed — "bank" has one location, whether you mean a riverbank or a savings bank. Meaning in real language depends on context. We need a mechanism that lets each token adjust itself based on its neighbors. That mechanism is the heart of the whole revolution.
A3 & A4 The Transformer and Attention
This is the engine. It's worth slowing down, because if you understand this one idea, you understand why the last decade happened.
The problem the field was stuck on. Before 2017, the leading approach read text the way you'd read through a straw — one word at a time, left to right, trying to cram everything it had seen so far into a single running "memory." (These were called RNNs, recurrent neural networks — recurrent meaning they looped over the sequence step by step.) Two fatal flaws: they forgot the beginning of long passages by the time they reached the end, and because each step depended on the one before it, they couldn't be sped up by doing the work in parallel. Training was slow, and long-range understanding was poor.
The 2017 breakthrough — a paper bluntly titled Attention Is All You Need4 — threw out the straw entirely. Its architecture, the Transformer, lets the model look at many words at once and, for each word, decide which other words it should pay attention to. That's it. That's the idea. It's called self-attention: a token gets to ask other tokens "how relevant are you to me, right now?" and weight them accordingly. One crucial detail for the chat models you actually use: they read left-to-right and look only backward — each token can attend to the words that came before it, never the ones still to come. (It has to be this way: when the model is predicting the next word, the future words don't exist yet.) This is called causal (or masked) attention.
Here's the analogy that makes it click. Picture a dinner-party conversation. When someone says "it," they mentally check back over what's already been said: what does "it" refer to? — and the word "it" effectively turns up the volume on the noun it points back to, and turns down the irrelevant chatter. Each word builds its understanding by selectively listening to the room. Where the metaphor stops: in a chat model, a guest can only hear the people who spoke before them — nobody hears the future. That's the causal rule above, and it's why the model can generate one word at a time at all.
Hover a word. Watch it look backward.
Earlier words light up by how hard the hovered word attends to them. Forward words grey out — the model never peeks at the future.
Attention weights are precomputed and illustrative. Causal rule enforced: a word can only attend to words before it.
Two design notes that pay off later:
So now we have an architecture: a tall stack of attention layers that turn a sequence of token-vectors into a rich, context-aware understanding, and finally into a prediction of the next token. But an architecture is an empty engine. It knows nothing yet. We have to fill it with knowledge. That's training, and it comes in three escalating acts.
A5 Pre-training
This is where a model gets its raw intelligence, and it's astonishingly simple to state: predict the next token, over and over, across a huge slice of human writing.
That's the entire objective. Show the model "The capital of France is ___" and have it guess; if it guesses wrong, nudge all those billions of internal numbers (called parameters — the knobs the model learns) a hair in the direction that would've been right. Do this trillions of times, over books, code, websites, and forums, and something remarkable happens: to get good at predicting text, the model is forced to learn the patterns behind the text — grammar, facts, a little arithmetic, the structure of an argument, the rhythm of a story. Understanding is a side effect of relentless autocomplete.
The pre-training loop.
No human grades anything. The internet IS the answer key — the next word is always sitting right there in the text.
Schematic of the self-supervised next-token objective. Input funnel: books + code + web + conversations.
This stage is why models are so expensive and why only a handful of players do it: it eats months of time on tens of thousands of GPUs at an estimated cost of tens to hundreds of millions of dollars for a frontier run.12 And it raises the central economic question of the field: given a fixed pile of money and compute, should you build a bigger model or feed it more data?
For a while everyone chased size — bigger model, bigger headlines. Then in 2022 a paper nicknamed Chinchilla showed the field had been doing it wrong: most big models were undertrained — too many parameters, too little data — and you'd get a smarter model for the same cost by making it smaller but feeding it far more text.5 The takeaway, now lore: data and model size must scale together. A model isn't "better" because it's bigger; it's better when its size and its training data are balanced for the compute you spent.
Bigger isn't smarter. Balanced is smarter.
For a fixed compute budget, error bottoms out where parameters and data are balanced — this chart quietly reset how every lab budgets a run.
After Hoffmann et al. 2022 (Chinchilla). Illustrative U-curve; no equations.
Why this matters for evaluation. When a startup brags about "a trillion-parameter model," the right question isn't "how big?" — it's "how much did you train it, and on what?" Parameter count alone is a vanity metric. Data quality and quantity are where models are actually won or lost.
At the end of pre-training you have a base model: a sprawling, knowledgeable, deeply weird text-predictor. It is not a helpful assistant. Ask it a question and it might continue with five more questions, because on the internet, questions are often followed by more questions. It has knowledge but no manners, no sense that it's supposed to help you. Fixing that is the next two acts.
A6 & A7 Fine-tuning and post-training
A base model is a brilliant, feral library that talks like the average of the internet. Post-training is the finishing school that turns it into the polite, helpful "assistant" you actually chat with. It's where a model gets its personality and its alignment — and it happens in steps.
Step one: Supervised fine-tuning (SFT) — show, don't tell. We collect a pile of high-quality example conversations — a human writes an ideal answer to a prompt — and we fine-tune the base model on them. Fine-tuning just means more training, but now on a small, curated set instead of the raw internet. The model learns the format of being helpful: when you ask a question, you answer it; you don't ramble; you follow instructions. This is imitation — the model copies good examples.
Step two: learning from preferences — rank, don't script. Imitation has a ceiling: humans can't hand-write an ideal answer to every possible prompt, and "good" is often a matter of taste and degree. So we switch from showing to judging. We have the model produce two answers, and a human (or another model) says "this one's better." Do this across mountains of comparisons, and you can teach the model to produce answers humans prefer — more helpful, more honest, less likely to confidently make things up.
The landmark here is InstructGPT / RLHF — Reinforcement Learning from Human Feedback.6 The recipe: use all those human preference judgments to train a reward model (a model that scores how good an answer is), then use reinforcement learning to push the assistant toward higher-scoring answers. RLHF is the single biggest reason ChatGPT felt like a leap over raw GPT-3: same underlying knowledge, radically better behavior. (The full machinery of how RL actually works is the hard part — that's exactly what the RL section below is for.)
From feral library to assistant.
Same brain the whole time — we're not adding knowledge, we're shaping behavior. Helpfulness rises left to right.
Schematic escalation. Same model throughout — each stage shapes behavior, not knowledge.
A few things worth internalizing, because they're where evaluation gets sharp:
The model is now born and raised: knowledgeable from pre-training, helpful from post-training. It sits frozen, finished, weighing in at billions of parameters. Now — finally — someone sends it a message. That's the last stage, and it's the only one that happens every single time you hit enter.
A8 Inference
Inference is the model running — taking your prompt and generating a reply. This is the loop from the on-ramp, and now you have the full picture of what's happening inside each step. Your message gets tokenized, embedded, and pushed up through the whole stack of attention layers, which produces a ranked list of likely next tokens. One is chosen. It's appended to the conversation. The whole thing runs again for the next token. And again. Word by word, which is exactly why replies stream onto your screen rather than appearing all at once.
Generation is one frozen loop.
No learning happens here — the parameters are locked. Rank the next token, pick one, append, repeat.
🔒 Parameters are frozen — no learning happens here; it's pure read-out.
Birth vs. life. Everything before inference — months, once, in a data center. Inference itself — a fraction of a second, every message, for every user on Earth.
Two things make inference the part of the lifecycle that businesses obsess over:
Temperature = creativity dial
Same probabilities, different boldness. Temperature is the user's dial between safe and creative.
Illustrative distribution; real vocabularies are ~100k tokens. Softmax-with-temperature, precomputed.
That's the full lifecycle: an empty architecture, filled with world-knowledge by pre-training, taught manners by post-training, and finally run, token by token, every time you ask it something. One story, eight stages, start to finish.
But notice we left a hole. Twice now — in RLHF, and in that warning about models learning to sound good rather than be good — we leaned on "reinforcement learning" and waved at it. RL is the hardest idea in this whole field to feel in your gut, and it's increasingly the thing separating frontier models from the pack. So let's give it the slow, careful treatment it deserves.
Section E Reinforcement Learning, in plain English
Everything so far — pre-training, fine-tuning — was the model learning from examples that already existed. Someone wrote the text; the model copied the pattern. Reinforcement learning (RL) is fundamentally different, and the difference is the whole point: there are no examples to copy. The model has to learn from the consequences of its own actions.
That's a big shift, so let's not start with the model at all. Let's start with a dog.
E1 The whole idea, in one analogy: training a dog
You want to teach a dog to sit. You can't explain it. You can't show it a textbook. All you can do is: wait, watch what the dog does, and reward the behavior you like. Dog flops down? No treat. Dog sits? Treat. Over many tries, the dog does more of what gets treats and less of what doesn't. It never gets told the rule — it discovers the rule by chasing the reward.
That is reinforcement learning, entire. Now here's the same picture with the five pieces of jargon labeled — because once you've seen them on the dog, they'll never scare you again:
Same five pieces, on a dog you already understand.
Every term below is the whole field of RL. None of them are new — you just learned them as a kid, teaching a dog to sit.
The Rosetta Stone for reinforcement learning. Five jargon words, one familiar scene.
Let's take the five pieces one at a time. We'll keep the dog around, and bring in a video game when it helps.
E2 Policy — the player's current strategy
The policy is just the model's current strategy for what to do next. For the dog, it's "given what I'm seeing and hearing, what should my body do?" For a language model, the policy is literally the model itself: given the conversation so far, what's its strategy for choosing the next token?
The entire goal of RL is to improve the policy — to make the strategy better over time. At the start it's bad (the dog flops, the model rambles). Each round of training nudges the strategy toward choices that earn more reward. When a lab says "we did RL on the model," they mean: we ran this loop to upgrade the model's strategy.
Think of it as a video game too: your policy is your current playing style — how good you are at the game right now. A beginner's policy mashes buttons; an expert's policy is refined. RL is the practice that turns one into the other.
E3 Reward — the treat, and the trouble with treats
The reward is the signal that tells the model how good an outcome was. Treat for the dog. Points for the game. For an LLM, the reward might come from that reward model we met in post-training (it scores "how much would a human like this answer?"), or — in the powerful newer setups — from something far more objective: did the code pass the tests? Did the math problem reach the correct final answer?
That last point is quietly enormous, and it's worth flagging now because it explains a lot of the 2025–2026 frontier:
Why math and code are the RL goldmine. In most of life, "was that a good answer?" is fuzzy and needs a human to judge. But for math and code there's an automatic, unarguable reward: the answer is right or wrong, the tests pass or fail. That means you can run RL at massive scale with no humans in the loop, generating millions of attempts and rewarding the ones that work. This is why the models that suddenly got dramatically better at math and coding got there through RL — those domains hand you a perfect treat-dispenser for free.
But rewards are also where RL gets dangerous, and this is the limitation you must understand to evaluate any "we used RL" claim. Whatever you reward, you get — including the loopholes. This is reward hacking (a flavor of Goodhart's Law: when a measure becomes a target, it stops being a good measure). The dog version: if you accidentally treat the dog every time it barks while sitting, you'll train a dog that sits and barks its head off, because barking became part of "what gets treats." The model version: if your reward model slightly prefers longer, more confident-sounding answers, RL will gleefully produce a model that's longer-winded and more confidently wrong — it found the loophole. RL optimizes exactly what you measure, not what you meant.
This is why frontier labs spend so much effort designing rewards that can't be gamed, and why "we did RL and the benchmark went up" should make you ask: did the model get smarter, or did it just learn to please your specific reward? (We'll return to that as a litmus test.)
E4 Rollout — one full attempt at the level
A rollout is one complete attempt, start to finish. The dog's single try at sitting. One full playthrough of a game level. For a language model, a rollout is the model generating a whole answer to a prompt — the entire response, start to "done."
Why give this its own word? Because RL learns by comparing many rollouts. You don't learn much from one attempt. You let the model take a hard problem and try it, say, a hundred different ways (this is where temperature and randomness earn their keep — they make the attempts vary). Some rollouts nail it; some flop. The reward sorts them. And from that spread of "this attempt good, that one bad," the model figures out what to do more of. Rollouts are the raw experience RL learns from — the model's own attempts are its only textbook.
One problem, many attempts.
RL doesn't need a textbook answer — it just needs to know which of its OWN attempts worked, then do more of those.
Illustrative rollouts for a math problem. Green checkmarks indicate correct answers, red X's indicate wrong answers.
E5 Advantage — "was that better than my usual?"
Here's the subtle one, and it's the key that makes RL actually work. Suppose the dog sits and gets a treat. Good — but how good? If the dog sits every time and always gets a treat, then this particular sit was nothing special; it's just average. But if the dog usually flops and this time it sat — that sit was a big positive surprise, and that's the moment worth reinforcing hard.
Advantage is exactly this: how much better (or worse) was this attempt compared to what I'd normally expect? Not the raw reward — the surprise in the reward. A rollout that scored above the model's average gets pushed harder ("do more of this!"); one that scored below gets pushed away ("less of that"); one that's exactly average barely moves anything.
Why not just use the raw reward? Because raw scores are noisy and uninformative on their own. A "7 out of 10" means nothing until you know whether 7 is great (you usually get 3s) or disappointing (you usually get 9s). Advantage is the baseline-subtracted signal — it strips out "how hard is this problem in general" and isolates "did this attempt beat my own expectation." That's the clean learning signal. It's why a beginner gamer improves fastest: almost everything they try is "better than my terrible average," so the advantage signal is strong and every small win teaches a lot.
Advantage = reward − baseline
Reward says "this scored 9." Advantage says "this beat your usual 6 — do more of it." The second one is what actually drives learning.
Interactive advantage calculation. The relative comparison is what drives learning, not the absolute reward scores.
Under the hood, lightly. The famous RL algorithms you'll hear named — PPO (the workhorse from RLHF) and the leaner GRPO that powered DeepSeek's math breakthrough — are, at heart, careful machinery for computing this advantage and nudging the policy by it without lurching too far in one step.9 10 That "don't lurch too far" guardrail matters: push the model too hard toward the reward in one update and it can break — forgetting its language skills while chasing points (a failure labs informally call drift, related to catastrophic forgetting) — which labs hold back with a leash (a KL penalty) tying the model to its sensible starting point. You don't need the equations. You need the shape: try many times, see which tries beat your average, lean that way — but gently.
E6 Exploration vs. exploitation — the gambler's dilemma
The last piece is the tension that sits underneath all of RL, and it's deeply human. Imagine your favorite restaurant. Every night you face a choice: order the dish you know is great (exploit what works), or try something new on the menu that might be even better — or might be a disappointment (explore). Order the usual forever and you'll never discover the better dish. Gamble every night and you'll eat a lot of bad meals. The art is the balance.
That's exploration vs. exploitation, and every RL system lives or dies by it:
Every step, a choice.
Learn nothing new, or risk everything — RL is the constant art of tuning this dial. Early on, explore boldly. As you get good, exploit what works.
The fundamental tension in reinforcement learning — between safety and discovery.
This is also where you can feel why RL on language is so much harder than RL in a game. In chess, every move is legal-or-not and the board tells you the truth. In language, the space of possible "moves" (sentences) is effectively infinite, the reward is often a fuzzy human judgment, and a model can explore its way straight into eloquent nonsense that fools the reward model. RL gave us the leap in reasoning models — but it's a leap walked on a knife's edge between "discovered something genuinely new" and "found a clever way to cheat the score."
E7 Putting it together — and how to use it as a bullshit detector
Step back and you can now read RL as one clean loop, in five plain words: try, score, compare, lean, repeat. The model (policy) takes many full attempts (rollouts), each earns a reward, advantage measures which attempts beat the model's own average, the strategy leans toward those — gently, on a leash to prevent drift — while balancing exploration against exploitation. Run that loop at scale, with a reward you can trust, and you get the dramatic reasoning gains of the modern era.
Try, score, compare, lean, repeat.
Every modern reasoning model is this loop, run a staggering number of times.
The complete reinforcement learning loop. Policy → Rollouts → Rewards → Advantage → Policy update → Repeat.
And here's the payoff — the reason a layperson should care about any of this. When someone tells you "our model got better because of reinforcement learning," you now own the questions that separate substance from spin:
If they have crisp, technical answers, you're likely looking at real work. If they wave their hands and say "we did RL," you now know enough to keep your wallet closed.
Notes Notes & sources
The conceptual backbone above is evergreen. The boxed material below dates, and is fenced off deliberately.
STATE OF PLAY — June 2026 · No single "best" model: GPT-5-series, Claude Opus 4.6/4.7, Gemini 3.1 Pro, and DeepSeek (V3.2 / V4) each lead different slices — science reasoning, coding, agentic tasks, and price-performance respectively. · RL on verifiable rewards (math, code) is the dominant frontier lever; open labs (DeepSeek, Qwen) reached the frontier largely via cheaper RL recipes (e.g. GRPO) rather than sheer scale. · Reasoning models that "think" before answering (test-time compute) are now standard at the frontier, not a novelty. Specific models/numbers will age fast; the mechanisms above will not.
Primary sources (canonical papers, verified via the Valency academic corpus)
Sourced for the boxed/dramatic claims (per research-discipline rule on dramatic numbers)
Supporting: Brown et al., Language Models are Few-Shot Learners (GPT-3, 2020, arXiv:2005.14165); Wei et al., Chain-of-Thought Prompting Elicits Reasoning in LLMs (2022, arXiv:2201.11903).