Skip to content

What is an LLM

Duration: 8 min Prerequisites: none

An LLM (Large Language Model) is a huge mathematical function that takes a chunk of text as input and predicts the most likely next word. Everything else follows from that.


An LLM does not generate a full answer in one go. It predicts a single token (a piece of a word, sometimes a whole word), appends it to the input, starts again, and so on until it predicts a special “end of answer” token.

Input: "Write a Java function that computes the"
[LLM, ~8 billion parameters]
Probability distribution for the next token:
" factorial" → 47%
" sum" → 19%
" power" → 8%
...
We pick "factorial", put everything back in,
and repeat for the next word.

That’s all. There is no little brain that “understands” the question. There is a probability distribution over ~150 000 possible tokens, and we draw the next one.

Concrete examples: what a "token" actually looks like

A token is not a word. It can be shorter or longer, and the same word may even be split differently depending on context.

| Text | Tokens (separated by |) | Count | | --- | --- | --- | | Hello, world! | Hello , world ! | 4 | | ChatGPT is cool | Chat G PT is cool | 5 | | def factorial(n): | def factorial ( n ): | 5 | | Méta-cognition (FR) | M é ta - cog nition | 6 | | 🚀 | <0xF0> <0x9F> <0x9A> <0x80> | 4 |

Practical consequences:

  • 1 common English word is roughly 1 token.
  • 1 accented French word is often 2 to 3 tokens.
  • 1 emoji is 3 to 4 tokens (UTF-8 byte encoding).
  • A block of code uses many tokens: every punctuation mark, indentation and operator counts.

This is why num_ctx = 20480 tokens does not mean 20480 words; it corresponds to roughly 12000 to 15000 words of English prose, or less for French and code.


When we say qwen2.5-coder:7b or llama3.1:8b, the 7b or 8b means 7 or 8 billion parameters. These are the weights of the neural network, learned during training. The more there are:

  • the better the quality (in general);
  • the more RAM or VRAM is needed;
  • the slower it is to run.

Orders of magnitude on your machine:

ModelParametersMin RAM/VRAMCPU speed
qwen2.5-coder:0.5b0.5 billion1 GBvery fast
qwen2.5-coder:3b3 billion4 GBfast
llama3.1:8b8 billion8 GBacceptable
qwen2.5-coder:14b14 billion12 GBslow without GPU

The model only reads the last N tokens of the conversation. Beyond that, it gets truncated and “forgets”. This limit is called the context window. In the repo’s demos we set it to 20 480 tokens (about 50 pages of text):

response = client.chat(
model=MODEL_NAME,
messages=messages,
tools=tools,
options={"num_ctx": 20480},
)

If you set num_ctx too small, the model loses track in the middle of a long code generation. It’s a classic gotcha for Qwen2.5-Coder with tool calling: at num_ctx=2048, it “hallucinates” tool calls without really executing them.

Reference values to pick from:

num_ctxApproximate capacityTypical use
2 0484-5 pages of proseShort Q&A, toy demos. Breaks tool calling on Qwen-coder.
8 19215-20 pagesNormal user conversation, moderate prompts.
20 480 (this course’s default)~50 pagesAgent loop with several round-trips and code reading.
32 768~80 pagesReading a full source file, or a long conversation history.
128 000 (Llama 3.1 maximum)~300 pagesWhole-project analysis. Very slow and RAM-hungry.

The cost grows roughly with num_ctx: doubling the context window roughly doubles the memory consumption and slows down each turn.

A parameter that controls creativity in the sampling: 0 = always the most likely token (deterministic), 1 = more variation allowed. For code generation you want a low temperature (0 to 0.3). For writing a poem, you turn it up (0.7 to 1.2).

In our demos we keep Ollama’s default value (0.8 for most models), but this is the first knob to turn if the model starts “inventing” things.

Concrete examples: the same prompt at four temperatures

Prompt used: “Continue this sentence in one line: The sun was rising over…”

TemperatureExpected behaviourExample output
0.0Deterministic. Always picks the most likely token. Same output on every run.”The sun was rising over the sleeping city.”
0.3Slight variability, safe and natural phrasing. Good for code generation.”The sun was rising over the grey rooftops of the old town.”
0.8 (Ollama default)Varied but coherent outputs. The default for chat-style usage.”The sun was rising over the bay, painting the sails in soft pink light.”
1.2High creativity, unexpected wording, risk of incoherence and run-on sentences.”The sun was rising over a brass ocean of crooked roofs, drunk on copper mist.”

To set the temperature explicitly in the Ollama Python SDK:

response = client.chat(
model=MODEL_NAME,
messages=messages,
options={"temperature": 0.2, "num_ctx": 20480},
)

For Java code generation in demos 3 and 4, keeping the temperature low (between 0 and 0.3) prevents the model from inventing classes or method signatures that do not exist. For prompt engineering experiments in chapter 11, a slightly higher temperature (0.5 to 0.8) makes the differences between system prompts more visible.

4. Top-p and top-k — the two other sampling knobs

Section titled “4. Top-p and top-k — the two other sampling knobs”

temperature reshapes the probability distribution over candidate tokens. top_k and top_p cut that distribution before the model samples from it. They control which tokens are even allowed to be picked.

  • top_k: keep only the k most likely tokens at each step. top_k = 1 is equivalent to greedy decoding (always picks the most likely). top_k = 50 is permissive.
  • top_p (nucleus sampling): keep the smallest set of tokens whose cumulative probability adds up to p. top_p = 1.0 disables the filter; top_p = 0.9 keeps roughly the top of the distribution and discards the long tail.

In practice, top_p and top_k are used together with temperature, not instead of it.

Concrete examples: the same prompt under different sampling configurations

Prompt used: “In one short sentence, describe a forest at dawn.”

ConfigurationFilter appliedTypical outputComment
temperature=0, top_k=1Greedy — only the single most likely token”A forest at dawn is quiet and full of light.”Reproducible. Always the same answer.
temperature=0.8, top_k=10, top_p=0.9 (Ollama default)Top 10 tokens, capped at 90% cumulative probability”Mist drifts between the pines as the first sunlight catches the leaves.”Balanced — varied but stays on topic.
temperature=0.8, top_k=200, top_p=1.0Long tail of unlikely tokens allowed”The crisp dawn whispers cobalt threads through the bramble cathedral.”More creative — but starts producing unusual word combinations.
temperature=1.5, top_k=200, top_p=1.0High temperature + open filter”Forest dawn — pomegranate sky bleeding into the moss, an osprey’s vowel cracks.”Erratic. Often incoherent on long generations.
response = client.chat(
model=MODEL_NAME,
messages=messages,
options={
"temperature": 0.8,
"top_k": 40,
"top_p": 0.9,
},
)

Two practical rules:

  1. For deterministic outputs (code generation, tool calling, structured JSON), use temperature=0 and let top_k / top_p default values be irrelevant — temperature 0 already forces greedy choice.
  2. For controlled creativity, the standard recipe is temperature around 0.7–0.9 with top_p=0.9 and top_k between 20 and 50. Pushing top_p to 1.0 or top_k above 100 opens the door to unusual tokens — useful for brainstorming, dangerous for production output.

MythReality
”The LLM understands my request.”It predicts the most likely continuation of tokens. The understanding is an illusion produced by the large training corpus.
”Bigger is always better.”Not always. A 3B model well-tuned for code sometimes beats a generalist 14B. Specialisation matters.
”It knows that 2 + 2 = 4.”It has seen 2 + 2 = 4 many times in the corpus. For 12345 × 67, it can get it wrong: that’s a rare text.
”It has internet access.”No. Unless a tool is explicitly wired up, it only has what it learned during training.
”It remembers yesterday’s conversation.”No. At every API call, we resend the full history in the messages parameter. No history = no memory.

A model like llama3.1:8b you download via Ollama is actually a model that has been fine-tuned to follow instructions. The base Llama 3.1 family was trained to predict the continuation of Wikipedia, GitHub code, etc. Then Meta re-trained it on (human instruction, good answer) pairs so it behaves like an assistant.

Part of this fine-tuning is the tool-calling capability (chapter 03): the model is taught that when a developer gives it a list of tools, it should reply with a structured JSON describing the call to make, rather than generating free text. Not all models can do this: that’s what makes the choice of model critical (chapter 05).


  • An LLM = next-token predictor, nothing more, nothing less.
  • Three important parameters: size (in billions of weights), context window (num_ctx), temperature.
  • No memory between two calls: the Python code resends the history at every turn.
  • No system access: it generates text, period. That’s what we fix in the next chapter.
  • Not all models are fine-tuned for tool calling. Keep that in mind for chapter 05.