Embeddings & Transformers

The leap: from counts to meaning

TF-IDF gives every word a sparse vector (mostly zeros) that captures how often it appears. It does not know that “car” and “automobile” are the same idea — they end up in completely different positions in the vocabulary.

Embeddings flip the script: give every word a dense, short vector (300 numbers instead of 50,000) that was learned so that similar meanings end up near each other.

flowchart LR
  BOW["Bag-of-Words / TF-IDF<br/>(sparse, ~50,000 dim)<br/>counts, no meaning"] --> EMB["Word embeddings<br/>(dense, ~300 dim)<br/>similar words are close"]
  EMB --> CTX["Contextual embeddings<br/>(BERT, GPT)<br/>same word = different vector<br/>depending on sentence"]
  classDef classical fill:#fee2e2,stroke:#dc2626
  classDef neural fill:#dbeafe,stroke:#2563eb
  classDef modern fill:#d1fae5,stroke:#047857
  BOW:::classical
  EMB:::neural
  CTX:::modern

Three generations of text vectors — sparse counts, learned word vectors, contextual vectors.

word2vec — the 2013 breakthrough

In 2013, Mikolov & co. at Google trained a tiny neural network on 3 billion words of news. The task was trivial: given a word, predict the words around it. The result was a representation that captures semantic structure.

After training, every word in the vocabulary had a 300-number vector — and the geometry encoded meaning:

king − man + woman ≈ queen
Paris − France + Germany ≈ Berlin
walking − walked + swam ≈ swimming

You can literally do arithmetic on words.

Why does this work?

word2vec’s hidden assumption: “a word is defined by the company it keeps” (Firth, 1957). If king and queen appear in the same kinds of sentences (“the ___ ruled the kingdom”), the model is forced to give them similar vectors. Same for Paris and Berlin. Same for walking and swimming.

We get semantic similarity for free.

Using pre-trained embeddings

You almost never train word2vec yourself — you download pre-trained vectors:

import gensim.downloader as api
model = api.load("glove-wiki-gigaword-300")   # ~400MB

print(model.most_similar("king", topn=3))
# [('prince', 0.82), ('queen', 0.78), ('emperor', 0.77)]

print(model.most_similar(positive=["king", "woman"], negative=["man"]))
# [('queen', 0.71), ...]

Limitation: bank (river) and bank (money) share the same vector. word2vec has no context.

What “similar meanings are close” actually looks like — cosine similarity table

Cosine similarity ranges from −1 (opposite) to +1 (identical). With pre-trained 300-dim GloVe vectors, here is the score for a few word pairs:

Word A	Word B	Cosine similarity	Comment
`king`	`queen`	0.78	Same semantic field (royalty), different gender
`king`	`prince`	0.82	Same field, different rank — even closer
`Paris`	`Berlin`	0.74	Both European capitals
`Paris`	`Tokyo`	0.51	Capitals — but different cultural context
`car`	`automobile`	0.71	Synonyms
`car`	`truck`	0.66	Same field (vehicles)
`car`	`cat`	0.18	Distant fields
`cat`	`photosynthesis`	0.04	Almost unrelated
`happy`	`joyful`	0.65	Near-synonyms
`happy`	`sad`	0.48	Counter-intuitive — both are emotions, appear in similar grammatical positions, so word2vec puts them close. Opposites are not far apart in embedding space.

import gensim.downloader as api
model = api.load("glove-wiki-gigaword-300")

for a, b in [("king","queen"), ("car","automobile"), ("happy","sad"), ("cat","photosynthesis")]:
    print(f"{a:14s} vs {b:14s}: {model.similarity(a, b):.2f}")

Two takeaways:

Semantic similarity is captured for free — synonyms, same-field words, geographic peers all land close together.
“Close” does not mean “same polarity”. happy and sad are close because they appear in the same kinds of sentences. Sentiment requires a model that has been trained to discriminate them — static embeddings alone cannot. This is one of the reasons contextual models (BERT, GPT) replaced word2vec for most production tasks.

Sentence embeddings — vectors for whole documents

If you average the word vectors of a sentence you get a sentence vector. Crude, but works:

import numpy as np
def sentence_vec(sentence):
    words = sentence.lower().split()
    vecs  = [model[w] for w in words if w in model]
    return np.mean(vecs, axis=0)

Modern way: sentence-transformers, a one-liner that gives you state-of-the-art sentence vectors:

from sentence_transformers import SentenceTransformer
m = SentenceTransformer("all-MiniLM-L6-v2")   # ~80MB

v = m.encode("Where can I buy a cheap red wine?")
# 384-dim numpy array

Those vectors power modern search (the dense half of “hybrid search”) and RAG (Retrieval-Augmented Generation — Course 2).

Enter Transformers — context-aware vectors

word2vec gives bank one vector. BERT (2018) gives bank a different vector depending on the sentence:

Sentence	Vector for “bank"
"I deposited cash at the bank.”	money-flavoured
”We sat by the river bank.”	river-flavoured

How? Two ideas glued together: attention + stacked layers.

The attention mechanism — one sentence to rule them all

For each word in a sentence, attention asks: “which other words in this sentence should I pay attention to in order to understand me?”

flowchart TB
  S["The cat sat on the mat because it was tired"]
  S --> Q["For each word,<br/>compute attention weights<br/>over every other word"]
  Q --> A["'it' &rarr; mostly looks at 'cat'<br/>'tired' &rarr; mostly looks at 'cat'<br/>'sat' &rarr; mostly looks at 'cat', 'mat'"]
  A --> C["Each word's final vector<br/>= weighted sum of the others"]
  classDef src fill:#fde68a,stroke:#c2410c
  classDef step fill:#dbeafe,stroke:#2563eb
  classDef out fill:#d1fae5,stroke:#047857
  S:::src
  Q:::step
  A:::step
  C:::out

Attention lets each word build its meaning from the relevant words around it.

Stack 12, 24, or 96 layers of attention on top of each other and you get BERT, GPT, Llama, Claude… All modern LLMs are stacks of attention. That’s it.

Where the words come in

Every Transformer starts with:

Tokenisation (sub-word, Lesson 2).
An embedding layer that maps each token to a vector — learned during training.

So embeddings haven’t disappeared in the LLM era — they are now inside the model, trained jointly with everything else.

When to use what — a practical guide

Situation	Best tool
1,000 emails to classify, want it now	TF-IDF + LogisticRegression (Lesson 5)
Semantic search over a knowledge base	Sentence embeddings + cosine similarity
Chatbot, RAG, code generation	LLM (GPT, Llama, Claude)
Fixed-domain NER (medical, legal)	Fine-tuned BERT
Translation, summarisation	Pre-trained Transformer (mBART, T5)
Offline, low-resource	TF-IDF or word2vec — embeddings/LLMs are expensive

Rule of thumb: always run the TF-IDF baseline first. If you reach the target metric, ship. Only escalate to embeddings or LLMs if you need to.

The classical → modern bridge, in one paragraph

You’ve now climbed the full ladder of NLP representations:

Token — the raw unit (Lesson 2).
Normalisation — make different writings look identical (Lesson 2).
Linguistic cleaning — stop words, stems, lemmas (Lesson 3).
Count vectors — Bag-of-Words (Lesson 4).
Weighted vectors — TF-IDF (Lesson 5).
Dense vectors — word2vec, sentence embeddings (this lesson).
Contextual dense vectors — BERT, GPT — built from layers of attention (this lesson).

The vocabulary stays the same. The maths gets fancier. Modern LLMs are a very fancy version of step 7.

That’s the whole bridge from a .txt file to an LLM. From here, Course 2 takes over — you’ll run a real LLM on your machine, plug it into tools, and build an agent.

Key takeaways

Embeddings = dense, short vectors where similar meanings are geometrically close.
word2vec (2013) introduced the idea; today we use sentence-transformers for off-the-shelf sentence vectors.
Transformers (BERT, GPT) give context-dependent vectors thanks to attention.
All modern LLMs are stacks of attention layers fed by a learned embedding.
Practical rule: TF-IDF baseline first, embeddings if it’s not enough, LLMs if even that isn’t.

You’ve finished Course 1. Next stop: Course 2 — Coding with a local LLM — running models on your own machine, plugging them into tools, building an agent that writes Java code.