Skip to content
Chaoran Huang
LLMs

LLMs: From Representation to Behavior (Part 2 outline)

Part 2 of a 2-part series. The roadmap from a pre-trained transformer to a useful, aligned, reasoning assistant — Transformers, pre-training, SFT, RLHF, prompting, retrieval, and reasoning. Currently published as a planning outline; each section is a hub for a future deep-dive page.

This is a planning outline, not the finished post.

Part 2 ships in stages. This page lays out the full roadmap so the Part 1 → Part 2 bridge has a concrete destination, and so each future deep-dive has a home to grow into. Sections currently sit at 2–4 sentences of intent. They will expand into full mini-essays — and then into their own spoke pages — over time.

The bridge from Part 1

Part 1 — From One-Hot to BERT was built around one question: how do we make meaning measurable in vector space? It walked from one-hot identity vectors to contextual sentence embeddings, ending on attention.

Part 2 picks up exactly there. Once you can compute a rich vector for every token, the question rotates: what do you do with that vector? The answer turned out to be much bigger than anyone expected. You stack a lot of attention layers, train them at internet scale on a single objective (predict the next token), and you get an LLM. Then you spend the rest of the lifecycle teaching that LLM to behave.

Part 1 turned meaning into geometry. Part 2 turns geometry into behavior.

The seven sections below follow that lifecycle in order: architecture → pre-train → fine-tune → align → prompt → augment → reason.

1. The Transformer Decoder

Part 1 introduced attention through BERT, which is an encoder: every token attends to every other token in both directions. Generative LLMs use the decoder side — the same attention mechanism, but causally masked so each token can only attend to tokens before it. That single change is what turns understanding into generation: the model can be unrolled token-by-token to produce a sequence.

This section covers the decoder block (masked self-attention + feed-forward + residuals + layer norm), positional encoding, the cost of stacking many layers, and why "decoder-only" became the dominant architecture for everything from GPT to Llama.

2. Pre-training at Scale

Take a transformer decoder, point it at a few trillion tokens of text, and ask it to do one thing at every position: predict the next token. That is the entire pre-training objective. The remarkable empirical fact of the last decade is that this single, almost embarrassingly simple objective — done at sufficient scale — produces a model with broad world knowledge, latent skills, and surprising emergent abilities.

This section covers the next-token-prediction loss, dataset composition (web crawls, code, books), tokenizers and context length, scaling laws (Kaplan, Chinchilla), compute economics, and what "emergence" actually means and doesn't mean.

3. Supervised Fine-Tuning (SFT)

A pre-trained model is a brilliant autocomplete. It will continue any text you give it — but it has no concept of "answer the user's question." SFT is the first nudge toward usefulness: collect a curated dataset of (prompt, ideal response) pairs and continue training the same model on that data. The model learns the format of being a helpful assistant.

This section covers instruction-tuning datasets (FLAN, ShareGPT, custom), chat templates, why even a small SFT dataset (~10k examples) dramatically reshapes behavior, and the techniques that make it cheap: LoRA, QLoRA, and other parameter-efficient methods.

4. RLHF and Alignment

SFT teaches the model the shape of a good answer. It does not yet teach it which of several plausible answers is best. Alignment closes that gap. The classic recipe — RLHF — trains a separate reward model on human preferences (annotators rank pairs of model outputs), then fine-tunes the LLM with reinforcement learning to maximize the reward model's score, typically with PPO. More recent work (DPO, KTO, ORPO) skips the explicit reward model and optimizes preference directly.

This section covers the reward-modeling step, PPO at a high level, DPO as the modern simpler alternative, what "alignment" actually optimizes for (helpful, harmless, honest), and the open problems: reward hacking, sycophancy, and the gap between preference data and ground truth.

5. Prompting and In-Context Learning

Once a model is pre-trained, fine-tuned, and aligned, you do not need to retrain it to get new behavior — you can just ask. In-context learning (ICL) is the surprising property that LLMs can pick up new patterns from examples shown in the prompt itself, with no weight updates at all. Add Q: / A: pairs and the model will follow the format. Add a few worked examples and it will solve the new instance.

This section covers zero-shot vs few-shot prompting, system prompts and roles, chain-of-thought as a prompting trick that improves reasoning, prompt sensitivity and brittleness, and why ICL works at all (a still-open theoretical question).

6. Retrieval and Tool Use (RAG and Agents)

LLMs hallucinate, can't cite sources, and forget anything outside their context window. The fix is to give them external memory and external actions. Retrieval-augmented generation (RAG) is the now-canonical pattern: encode your private corpus into sentence embeddings (back to SBERT from Part 1, §10), look up the most relevant chunks for a query, paste them into the prompt, and let the model answer with citations. Tool use generalizes the same idea: instead of pasting documents, paste the result of an API call, a database query, or a code execution.

This section covers the RAG pipeline end-to-end (chunking → embedding → vector store → reranking → prompt injection), tool-calling APIs, agent loops (ReAct, plan-and-execute), and the operational headaches that show up in production.

7. Reasoning Models

The most recent shift is to spend more compute at inference time rather than only at training time. Models like o1, R1, and their successors are trained to produce long internal "thought" sequences before answering, and to revise their own reasoning. The big idea: if a problem is hard, the model should be allowed to think for a while — sometimes for thousands of tokens — before committing to an answer. This trades latency for accuracy, and on hard problems (math, code, multi-step planning) it works.

This section covers test-time compute as a new scaling axis, RL on reasoning traces, process supervision vs outcome supervision, the relationship to chain-of-thought, and where this goes next.

Closing the loop

Together, Part 1 and Part 2 trace one continuous arc:

One-hot vectors → dense embeddings → contextual embeddings → attention → decoder-only transformers → pre-training at scale → instruction following → alignment → prompting → retrieval → reasoning.

Each step is a model patching the previous step's blind spot. The same instinct that drove the field from one-hot to BERT is still driving it now — only the thing being measured has shifted from word similarity to behavioral usefulness.

If you have not read Part 1, it lives at From One-Hot to BERT. If you have, the rest of this hub is the work I am queueing up next.