Skip to content

Embeddings & Transformers

TF-IDF gives every word a sparse vector (mostly zeros) that captures how often it appears. It does not know that “car” and “automobile” are the same idea — they end up in completely different positions in the vocabulary.

Embeddings flip the script: give every word a dense, short vector (300 numbers instead of 50,000) that was learned so that similar meanings end up near each other.

flowchart LR
  BOW["Bag-of-Words / TF-IDF<br/>(sparse, ~50,000 dim)<br/>counts, no meaning"] --> EMB["Word embeddings<br/>(dense, ~300 dim)<br/>similar words are close"]
  EMB --> CTX["Contextual embeddings<br/>(BERT, GPT)<br/>same word = different vector<br/>depending on sentence"]
  classDef classical fill:#fee2e2,stroke:#dc2626
  classDef neural fill:#dbeafe,stroke:#2563eb
  classDef modern fill:#d1fae5,stroke:#047857
  BOW:::classical
  EMB:::neural
  CTX:::modern
Three generations of text vectors — sparse counts, learned word vectors, contextual vectors.

In 2013, Mikolov & co. at Google trained a tiny neural network on 3 billion words of news. The task was trivial: given a word, predict the words around it. The result was a representation that captures semantic structure.

After training, every word in the vocabulary had a 300-number vector — and the geometry encoded meaning:

king − man + woman ≈ queen
Paris − France + Germany ≈ Berlin
walking − walked + swam ≈ swimming

You can literally do arithmetic on words.

word2vec’s hidden assumption: “a word is defined by the company it keeps” (Firth, 1957). If king and queen appear in the same kinds of sentences (“the ___ ruled the kingdom”), the model is forced to give them similar vectors. Same for Paris and Berlin. Same for walking and swimming.

We get semantic similarity for free.

You almost never train word2vec yourself — you download pre-trained vectors:

import gensim.downloader as api
model = api.load("glove-wiki-gigaword-300") # ~400MB
print(model.most_similar("king", topn=3))
# [('prince', 0.82), ('queen', 0.78), ('emperor', 0.77)]
print(model.most_similar(positive=["king", "woman"], negative=["man"]))
# [('queen', 0.71), ...]

Limitation: bank (river) and bank (money) share the same vector. word2vec has no context.

What “similar meanings are close” actually looks like — cosine similarity table

Cosine similarity ranges from −1 (opposite) to +1 (identical). With pre-trained 300-dim GloVe vectors, here is the score for a few word pairs:

Word AWord BCosine similarityComment
kingqueen0.78Same semantic field (royalty), different gender
kingprince0.82Same field, different rank — even closer
ParisBerlin0.74Both European capitals
ParisTokyo0.51Capitals — but different cultural context
carautomobile0.71Synonyms
cartruck0.66Same field (vehicles)
carcat0.18Distant fields
catphotosynthesis0.04Almost unrelated
happyjoyful0.65Near-synonyms
happysad0.48Counter-intuitive — both are emotions, appear in similar grammatical positions, so word2vec puts them close. Opposites are not far apart in embedding space.
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-300")
for a, b in [("king","queen"), ("car","automobile"), ("happy","sad"), ("cat","photosynthesis")]:
print(f"{a:14s} vs {b:14s}: {model.similarity(a, b):.2f}")

Two takeaways:

  1. Semantic similarity is captured for free — synonyms, same-field words, geographic peers all land close together.
  2. “Close” does not mean “same polarity”. happy and sad are close because they appear in the same kinds of sentences. Sentiment requires a model that has been trained to discriminate them — static embeddings alone cannot. This is one of the reasons contextual models (BERT, GPT) replaced word2vec for most production tasks.

Sentence embeddings — vectors for whole documents

Section titled “Sentence embeddings — vectors for whole documents”

If you average the word vectors of a sentence you get a sentence vector. Crude, but works:

import numpy as np
def sentence_vec(sentence):
words = sentence.lower().split()
vecs = [model[w] for w in words if w in model]
return np.mean(vecs, axis=0)

Modern way: sentence-transformers, a one-liner that gives you state-of-the-art sentence vectors:

from sentence_transformers import SentenceTransformer
m = SentenceTransformer("all-MiniLM-L6-v2") # ~80MB
v = m.encode("Where can I buy a cheap red wine?")
# 384-dim numpy array

Those vectors power modern search (the dense half of “hybrid search”) and RAG (Retrieval-Augmented Generation — Course 2).

Enter Transformers — context-aware vectors

Section titled “Enter Transformers — context-aware vectors”

word2vec gives bank one vector. BERT (2018) gives bank a different vector depending on the sentence:

SentenceVector for “bank"
"I deposited cash at the bank.”money-flavoured
”We sat by the river bank.”river-flavoured

How? Two ideas glued together: attention + stacked layers.

The attention mechanism — one sentence to rule them all

Section titled “The attention mechanism — one sentence to rule them all”

For each word in a sentence, attention asks: “which other words in this sentence should I pay attention to in order to understand me?”

flowchart TB
  S["The cat sat on the mat because it was tired"]
  S --> Q["For each word,<br/>compute attention weights<br/>over every other word"]
  Q --> A["'it' &rarr; mostly looks at 'cat'<br/>'tired' &rarr; mostly looks at 'cat'<br/>'sat' &rarr; mostly looks at 'cat', 'mat'"]
  A --> C["Each word's final vector<br/>= weighted sum of the others"]
  classDef src fill:#fde68a,stroke:#c2410c
  classDef step fill:#dbeafe,stroke:#2563eb
  classDef out fill:#d1fae5,stroke:#047857
  S:::src
  Q:::step
  A:::step
  C:::out
Attention lets each word build its meaning from the relevant words around it.

Stack 12, 24, or 96 layers of attention on top of each other and you get BERT, GPT, Llama, Claude… All modern LLMs are stacks of attention. That’s it.

Every Transformer starts with:

  1. Tokenisation (sub-word, Lesson 2).
  2. An embedding layer that maps each token to a vector — learned during training.

So embeddings haven’t disappeared in the LLM era — they are now inside the model, trained jointly with everything else.

SituationBest tool
1,000 emails to classify, want it nowTF-IDF + LogisticRegression (Lesson 5)
Semantic search over a knowledge baseSentence embeddings + cosine similarity
Chatbot, RAG, code generationLLM (GPT, Llama, Claude)
Fixed-domain NER (medical, legal)Fine-tuned BERT
Translation, summarisationPre-trained Transformer (mBART, T5)
Offline, low-resourceTF-IDF or word2vec — embeddings/LLMs are expensive

Rule of thumb: always run the TF-IDF baseline first. If you reach the target metric, ship. Only escalate to embeddings or LLMs if you need to.

The classical → modern bridge, in one paragraph

Section titled “The classical → modern bridge, in one paragraph”

You’ve now climbed the full ladder of NLP representations:

  1. Token — the raw unit (Lesson 2).
  2. Normalisation — make different writings look identical (Lesson 2).
  3. Linguistic cleaning — stop words, stems, lemmas (Lesson 3).
  4. Count vectors — Bag-of-Words (Lesson 4).
  5. Weighted vectors — TF-IDF (Lesson 5).
  6. Dense vectors — word2vec, sentence embeddings (this lesson).
  7. Contextual dense vectors — BERT, GPT — built from layers of attention (this lesson).

The vocabulary stays the same. The maths gets fancier. Modern LLMs are a very fancy version of step 7.

That’s the whole bridge from a .txt file to an LLM. From here, Course 2 takes over — you’ll run a real LLM on your machine, plug it into tools, and build an agent.

  • Embeddings = dense, short vectors where similar meanings are geometrically close.
  • word2vec (2013) introduced the idea; today we use sentence-transformers for off-the-shelf sentence vectors.
  • Transformers (BERT, GPT) give context-dependent vectors thanks to attention.
  • All modern LLMs are stacks of attention layers fed by a learned embedding.
  • Practical rule: TF-IDF baseline first, embeddings if it’s not enough, LLMs if even that isn’t.

You’ve finished Course 1. Next stop: Course 2 — Coding with a local LLM — running models on your own machine, plugging them into tools, building an agent that writes Java code.