Embeddings & Transformers
The leap: from counts to meaning
Section titled “The leap: from counts to meaning”TF-IDF gives every word a sparse vector (mostly zeros) that captures how often it appears. It does not know that “car” and “automobile” are the same idea — they end up in completely different positions in the vocabulary.
Embeddings flip the script: give every word a dense, short vector (300 numbers instead of 50,000) that was learned so that similar meanings end up near each other.
flowchart LR BOW["Bag-of-Words / TF-IDF<br/>(sparse, ~50,000 dim)<br/>counts, no meaning"] --> EMB["Word embeddings<br/>(dense, ~300 dim)<br/>similar words are close"] EMB --> CTX["Contextual embeddings<br/>(BERT, GPT)<br/>same word = different vector<br/>depending on sentence"] classDef classical fill:#fee2e2,stroke:#dc2626 classDef neural fill:#dbeafe,stroke:#2563eb classDef modern fill:#d1fae5,stroke:#047857 BOW:::classical EMB:::neural CTX:::modern
word2vec — the 2013 breakthrough
Section titled “word2vec — the 2013 breakthrough”In 2013, Mikolov & co. at Google trained a tiny neural network on 3 billion words of news. The task was trivial: given a word, predict the words around it. The result was a representation that captures semantic structure.
After training, every word in the vocabulary had a 300-number vector — and the geometry encoded meaning:
king − man + woman ≈ queenParis − France + Germany ≈ Berlinwalking − walked + swam ≈ swimmingYou can literally do arithmetic on words.
Why does this work?
Section titled “Why does this work?”word2vec’s hidden assumption: “a word is defined by the company it keeps” (Firth, 1957). If king and queen appear in the same kinds of sentences (“the ___ ruled the kingdom”), the model is forced to give them similar vectors. Same for Paris and Berlin. Same for walking and swimming.
We get semantic similarity for free.
Using pre-trained embeddings
Section titled “Using pre-trained embeddings”You almost never train word2vec yourself — you download pre-trained vectors:
import gensim.downloader as apimodel = api.load("glove-wiki-gigaword-300") # ~400MB
print(model.most_similar("king", topn=3))# [('prince', 0.82), ('queen', 0.78), ('emperor', 0.77)]
print(model.most_similar(positive=["king", "woman"], negative=["man"]))# [('queen', 0.71), ...]Limitation: bank (river) and bank (money) share the same vector. word2vec has no context.
What “similar meanings are close” actually looks like — cosine similarity table
Cosine similarity ranges from −1 (opposite) to +1 (identical). With pre-trained 300-dim GloVe vectors, here is the score for a few word pairs:
| Word A | Word B | Cosine similarity | Comment |
|---|---|---|---|
king | queen | 0.78 | Same semantic field (royalty), different gender |
king | prince | 0.82 | Same field, different rank — even closer |
Paris | Berlin | 0.74 | Both European capitals |
Paris | Tokyo | 0.51 | Capitals — but different cultural context |
car | automobile | 0.71 | Synonyms |
car | truck | 0.66 | Same field (vehicles) |
car | cat | 0.18 | Distant fields |
cat | photosynthesis | 0.04 | Almost unrelated |
happy | joyful | 0.65 | Near-synonyms |
happy | sad | 0.48 | Counter-intuitive — both are emotions, appear in similar grammatical positions, so word2vec puts them close. Opposites are not far apart in embedding space. |
import gensim.downloader as apimodel = api.load("glove-wiki-gigaword-300")
for a, b in [("king","queen"), ("car","automobile"), ("happy","sad"), ("cat","photosynthesis")]: print(f"{a:14s} vs {b:14s}: {model.similarity(a, b):.2f}")Two takeaways:
- Semantic similarity is captured for free — synonyms, same-field words, geographic peers all land close together.
- “Close” does not mean “same polarity”.
happyandsadare close because they appear in the same kinds of sentences. Sentiment requires a model that has been trained to discriminate them — static embeddings alone cannot. This is one of the reasons contextual models (BERT, GPT) replaced word2vec for most production tasks.
Sentence embeddings — vectors for whole documents
Section titled “Sentence embeddings — vectors for whole documents”If you average the word vectors of a sentence you get a sentence vector. Crude, but works:
import numpy as npdef sentence_vec(sentence): words = sentence.lower().split() vecs = [model[w] for w in words if w in model] return np.mean(vecs, axis=0)Modern way: sentence-transformers, a one-liner that gives you state-of-the-art sentence vectors:
from sentence_transformers import SentenceTransformerm = SentenceTransformer("all-MiniLM-L6-v2") # ~80MB
v = m.encode("Where can I buy a cheap red wine?")# 384-dim numpy arrayThose vectors power modern search (the dense half of “hybrid search”) and RAG (Retrieval-Augmented Generation — Course 2).
Enter Transformers — context-aware vectors
Section titled “Enter Transformers — context-aware vectors”word2vec gives bank one vector. BERT (2018) gives bank a different vector depending on the sentence:
| Sentence | Vector for “bank" |
|---|---|
| "I deposited cash at the bank.” | money-flavoured |
| ”We sat by the river bank.” | river-flavoured |
How? Two ideas glued together: attention + stacked layers.
The attention mechanism — one sentence to rule them all
Section titled “The attention mechanism — one sentence to rule them all”For each word in a sentence, attention asks: “which other words in this sentence should I pay attention to in order to understand me?”
flowchart TB S["The cat sat on the mat because it was tired"] S --> Q["For each word,<br/>compute attention weights<br/>over every other word"] Q --> A["'it' → mostly looks at 'cat'<br/>'tired' → mostly looks at 'cat'<br/>'sat' → mostly looks at 'cat', 'mat'"] A --> C["Each word's final vector<br/>= weighted sum of the others"] classDef src fill:#fde68a,stroke:#c2410c classDef step fill:#dbeafe,stroke:#2563eb classDef out fill:#d1fae5,stroke:#047857 S:::src Q:::step A:::step C:::out
Stack 12, 24, or 96 layers of attention on top of each other and you get BERT, GPT, Llama, Claude… All modern LLMs are stacks of attention. That’s it.
Where the words come in
Section titled “Where the words come in”Every Transformer starts with:
- Tokenisation (sub-word, Lesson 2).
- An embedding layer that maps each token to a vector — learned during training.
So embeddings haven’t disappeared in the LLM era — they are now inside the model, trained jointly with everything else.
When to use what — a practical guide
Section titled “When to use what — a practical guide”| Situation | Best tool |
|---|---|
| 1,000 emails to classify, want it now | TF-IDF + LogisticRegression (Lesson 5) |
| Semantic search over a knowledge base | Sentence embeddings + cosine similarity |
| Chatbot, RAG, code generation | LLM (GPT, Llama, Claude) |
| Fixed-domain NER (medical, legal) | Fine-tuned BERT |
| Translation, summarisation | Pre-trained Transformer (mBART, T5) |
| Offline, low-resource | TF-IDF or word2vec — embeddings/LLMs are expensive |
Rule of thumb: always run the TF-IDF baseline first. If you reach the target metric, ship. Only escalate to embeddings or LLMs if you need to.
The classical → modern bridge, in one paragraph
Section titled “The classical → modern bridge, in one paragraph”You’ve now climbed the full ladder of NLP representations:
- Token — the raw unit (Lesson 2).
- Normalisation — make different writings look identical (Lesson 2).
- Linguistic cleaning — stop words, stems, lemmas (Lesson 3).
- Count vectors — Bag-of-Words (Lesson 4).
- Weighted vectors — TF-IDF (Lesson 5).
- Dense vectors — word2vec, sentence embeddings (this lesson).
- Contextual dense vectors — BERT, GPT — built from layers of attention (this lesson).
The vocabulary stays the same. The maths gets fancier. Modern LLMs are a very fancy version of step 7.
That’s the whole bridge from a .txt file to an LLM. From here, Course 2 takes over — you’ll run a real LLM on your machine, plug it into tools, and build an agent.
Key takeaways
Section titled “Key takeaways”- Embeddings = dense, short vectors where similar meanings are geometrically close.
- word2vec (2013) introduced the idea; today we use sentence-transformers for off-the-shelf sentence vectors.
- Transformers (BERT, GPT) give context-dependent vectors thanks to attention.
- All modern LLMs are stacks of attention layers fed by a learned embedding.
- Practical rule: TF-IDF baseline first, embeddings if it’s not enough, LLMs if even that isn’t.
You’ve finished Course 1. Next stop: Course 2 — Coding with a local LLM — running models on your own machine, plugging them into tools, building an agent that writes Java code.