Skip to content
Chaoran Huang
NLP

From One-Hot to BERT: How NLP Learned to Represent Meaning

Part 1 of a 2-part series. A historical, geometry-first walk through NLP representations: from one-hot vectors to contextual embeddings — the ideas that made LLMs possible. Each section is a hub for a future deep-dive page.

This is a planning outline, not the finished post.

This is just an outline

1. The Core Problem

Why do cat and dog feel close to us, but not to a computer? To us they share a shape: small, four-legged, makes noise, lives in a house. To a raw text file they are just two different strings, no closer to each other than to Tuesday or quark.

Every interesting NLP task — spam classification, semantic search, machine translation, retrieval-augmented generation — is built on top of one quiet, unglamorous step: turning words into numbers. The job is not just to give each word a number, but to give it the right number, so that mathematical operations on those numbers behave like operations on meaning.

The thread running through this entire post is one sentence:

NLP is the search for a vector space where geometry equals meaning.

Every model we will look at — one-hot, Word2Vec, ELMo, BERT, SBERT — is a different attempt to build that space. Models are milestones; the protagonist is the geometry.

This is Part 1 of a 2-part series. Part 1 covers representation — how we turned language into vectors. Part 2 covers generation — how those vectors became LLMs.

What changed in vector space? — Nothing yet; we are about to build it.

2. One-Hot Encoding: Words as IDs

Before anything clever, we need a way to point at words at all. The earliest representation is the simplest one possible: pick a vocabulary VV, give each word a unique index, and represent that word as a vector that is all zeros except for a single 1 at its index.

x{0,1}V,ixi=1x \in \{0,1\}^{|V|}, \qquad \sum_i x_i = 1

That gives every word a unique fingerprint — perfect identity. The trouble shows up the moment we try to compare two of them. Take the cosine similarity of any two distinct one-hot vectors:

cos(xcat,xdog)=0\cos(x_{\text{cat}},\, x_{\text{dog}}) = 0

Every pair of distinct words is orthogonal. cat is no closer to dog than to Tuesday than to quark. The space has points but no neighborhoods.

A small caveat we will not dwell on: real systems do not split text on spaces. They use subword tokenizers like BPE or WordPiece, so a "word" in this section might really be a token. The geometry argument is identical regardless of what we put on the axes.

What changed in vector space? — We have points, but no neighborhoods.

3. Meaning from Co-occurrence

One-hot fails because the vectors carry zero information about what a word is like. We need a way to make cat and dog end up close. The classical answer comes from linguistics, in a quote attributed to J.R. Firth:

You shall know a word by the company it keeps.

You have probably seen bag-of-words: count how many times each word appears in a document, then compare documents by those counts. Co-occurrence flips the same trick — instead of counting words per document, count which words appear next to which other words within a small window. Same arithmetic, smaller window, and now we are describing meaning instead of topics.

Concretely, build a word-context matrix: rows are words, columns are context words within a window of, say, ±5\pm 5, and each cell holds a co-occurrence count (or a smoothed version like PPMI). Each row of this matrix is now a vector for that word, and we can compare two of them with cosine similarity:

cos(u,v)=uvuv\cos(u, v) = \dfrac{u \cdot v}{\|u\|\,\|v\|}

For the first time, cat and dog come out genuinely similar — not because anyone told the system so, but because they appear next to the same neighbors: feed, pet, tail, vet. Meaning emerged from statistics.

The catch: these vectors are huge (V|V|-dimensional) and mostly zero. The space works, but it is wasteful.

wordfeedpettailbank
cat121890
dog1420110
money00017

What changed in vector space? — Similar meaning now produces nearby vectors, but the space is sparse and huge.

4. Word2Vec: Dense Embeddings

Sparse co-occurrence vectors get the geometry right but the dimensionality is brutal — one axis per vocabulary word, almost all of them zero. Word2Vec (Mikolov et al., 2013) keeps the geometry and throws out the waste by learning a dense vector of typically 100–300 dimensions per word.

It does this by training a tiny neural network on one of two prediction tasks:

  • CBOW: predict the center word from its surrounding context.
  • Skip-gram: predict the surrounding context from the center word.

Either objective forces the model to push together the vectors of words that play similar roles, because their gradients pull in similar directions. To make training fast on huge corpora, the original paper introduces negative sampling: instead of normalizing over the whole vocabulary at every step, contrast the true context word against a handful of randomly sampled "negative" words.

The dense vectors that fall out of this have a property no one designed in deliberately, and which still feels a little magical when you see it for the first time:

kingman+womanqueen\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}

Meaning is no longer just a matter of being close. It has direction. There is a "gender" axis, a "tense" axis, a "country–capital" axis, all discovered by predicting blanks in a corpus.

What changed in vector space? — Dense, low-dimensional, and meaning has direction (analogies work).

5. Why Word2Vec Is Not Enough

Dense vectors are a huge step forward. But notice what Word2Vec quietly assumes: one word, one vector, forever. Every appearance of bank in every sentence ever written gets the exact same embedding.

Consider:

  • "I sat on the river bank."
  • "I deposited cash at the bank."

Two clearly different meanings, one identical vector. The same is true for play, bat, charge, cell, light, and a long tail of polysemous words. Static embeddings are blind to context.

What changed in vector space? — Nothing changed; that is the problem.

6. Language Modeling: Why Order Matters

There is a second, deeper limitation that Word2Vec also has: it sees a window of words but not their order. To a true bag-of-words view, "dog bites man" and "man bites dog" are the same sentence. They are not.

The classical NLP framing for taking order seriously is language modeling: assign a probability to a sequence of words, factored as the product of conditional next-token probabilities.

P(w1,w2,,wT)=t=1TP(wtw<t)P(w_1, w_2, \dots, w_T) = \prod_{t=1}^{T} P(w_t \mid w_{<t})

Read it left to right: the probability of the whole sentence is the product, at each position, of how likely the next word is given everything before it. That single objective is the conceptual seed that will eventually grow into GPT — but we are getting ahead of ourselves. For now, language modeling is just the first time our system has to care about position, not just identity.

What changed in vector space? — Representations now need to depend on position, not just identity.

7. RNN and LSTM: Adding Memory

We now have an objective (predict the next word) that demands sequence-awareness, but Word2Vec's static lookup table cannot deliver it. The recurrent neural network (RNN) was the first widely used architecture that could.

An RNN reads tokens one at a time and carries a hidden state hth_t forward as it goes. At each step it folds the new input into what it already remembers:

ht=g(Uht1+Wxt)h_t = g(U h_{t-1} + W x_t)

where gg is a non-linearity, WW projects the new input, and UU propagates the previous state. Stack these steps and you have a network that, in principle, can remember anything from arbitrarily far back.

In practice, vanilla RNNs struggle with long-range dependencies: gradients vanish or explode as they backpropagate through many time steps, so the model effectively forgets things from more than a handful of tokens ago. The LSTM (Hochreiter & Schmidhuber, 1997) fixes this by adding three learnable gates — forget, input, and output — that decide, at each step, what to keep, what to overwrite, and what to expose.

Inside each hₜ, an LSTM cell is running its forget/input/output gates over the previous state and the new input. The chain itself is the new idea: representation is now a function of everything the model has seen so far.

What changed in vector space? — A word's representation now depends on everything that came before it (left context).

8. ELMo: Contextual Word Embeddings

The RNN gives us a way to roll context up over time, but if we feed it Word2Vec embeddings as inputs, the starting point for every occurrence of bank is still the same vector. The contextual mixing only happens inside the hidden states. ELMo (Peters et al., 2018) moves the contextualization into the embedding itself.

The trick: train a bidirectional LSTM as a language model — one direction reading left-to-right, the other right-to-left — and then, for any token in any sentence, take the LSTM's hidden states at that position (across all layers) and combine them into a single vector. That vector is the contextual embedding.

The same word now produces different vectors in different sentences:

  • "I sat on the river bank."bank lands near shore, stream, boat.
  • "I deposited cash at the bank."bank lands near loan, teller, interest.

What changed in vector space? — A word is no longer a point; it is a function of its sentence.

9. BERT: Full Context with Attention

ELMo proved that contextual embeddings work, but reading a sentence left-to-right and then right-to-left through an LSTM is slow and indirect. Information about the last token has to crawl through every hidden state to influence the first. BERT (Devlin et al., 2018) and the transformer architecture it is built on (Vaswani et al., 2017) replace recurrence with self-attention: every token can look at every other token in a single step.

The single equation behind it all is the scaled dot-product attention:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V

The intuition for QQ, KK, VV in one breath: each token emits a query describing what it is looking for, a key describing what it offers, and a value that gets passed along when something matches. The softmax over QKQ K^\top produces, for every token, a distribution over all other tokens — a soft pointer that says here is who I am paying attention to right now. The output for each token is a weighted blend of the value vectors of everyone it attended to.

BERT pre-trains this stack on two self-supervised tasks: Masked Language Modeling (predict randomly hidden tokens) and Next Sentence Prediction (decide whether two sentences are adjacent). Out of that comes a network in which every token's representation is built from the entire surrounding sentence in parallel.

The thick edge between cat and sat is doing what an attention head does: spending most of its weight on the relationship that matters here.

What changed in vector space? — Each token's representation now mixes information from the entire sentence at once.

10. Sentence Embeddings and SBERT

BERT gives spectacular per-token vectors, but if you actually want to do sentence-level work — semantic search, deduplication, clustering, retrieval for RAG — comparing two sentences with vanilla BERT requires running them through the network together, once per pair. For a corpus of a million sentences, that is an unworkable number of forward passes.

SBERT (Reimers & Gurevych, 2019) fixes this by fine-tuning BERT in a siamese setup: two identical BERT towers process two sentences independently, and a contrastive objective pulls the resulting sentence vectors together when the sentences mean the same thing and pushes them apart when they do not. After training, you can encode every document in your corpus once, store the vectors, and answer queries with a fast nearest-neighbor lookup.

This is the foundation that semantic search, vector databases, and the retrieval half of RAG are built on. The geometry trick we learned for words now works for whole sentences.

What changed in vector space? — Geometry now works at the sentence level, not just the word level.

11. Conclusion and Bridge to LLMs

If you stand back from the individual models, the through-line is short:

The history of NLP representation is the history of making meaning measurable in vector space.

Every section was one move along that thread:

  • Identity (one-hot) — every word has a name, but no neighbors.
  • Similarity (co-occurrence, sparse VSMs) — words that share company end up in similar vectors.
  • Density and direction (Word2Vec) — meaning compresses into a low-dimensional space where analogies become arithmetic.
  • Sequence (language modeling, RNN, LSTM) — order starts to matter; representations carry history.
  • Context (ELMo) — the same word becomes different vectors in different sentences.
  • Global context (BERT, attention) — every token mixes the whole sentence in one step.
  • Sentence-level geometry (SBERT) — the same trick scales up from words to whole sentences.

Once embeddings, sequence modeling, and attention came together at scale, models stopped just understanding language and started generating it. That story — Transformers as decoders, GPT, instruction tuning, RLHF, prompting, reasoning — is Part 2.