Skip to content

Train/test split & training

A model that has memorised the answers will look brilliant — until it meets data it has never seen. To know if your model is genuinely smart or just overfitting, you must hold back a slice of the data the model never touches during training.

flowchart LR
  D["Full dataset<br/>(100%)"]
  D --> Tr["Train<br/>(70–80%)"]
  D --> Va["Validation<br/>(10–15%)<br/>(optional)"]
  D --> Te["Test<br/>(10–20%)"]
  Tr -.->|"fit the model"| M["Model"]
  Va -.->|"tune hyperparameters"| M
  Te -.->|"final score (only ONCE)"| S["Honest performance"]
  classDef train fill:#dbeafe,stroke:#2563eb
  classDef val fill:#fef3c7,stroke:#c2410c
  classDef test fill:#fee2e2,stroke:#dc2626
  Tr:::train
  Va:::val
  Te:::test
The three slices of your data. The test set is sacred — touch it only at the end.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

Use when rows are independent and identically distributed — most tabular problems.

If your y is “fraud / not fraud” with 99% / 1%, a random split might give a test set with zero fraud rows. Stratify keeps the same ratio in train and test:

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)

Always stratify on classification problems.

If the data has a time axis (sales, sensors, user activity), a random split is leakage — the model would see future and predict past. Split by date:

X_train = df[df['date'] < '2025-01-01']
X_test = df[df['date'] >= '2025-01-01']

Why 80/20 and not 50/50 or 95/5? — a concrete look

Section titled “Why 80/20 and not 50/50 or 95/5? — a concrete look”

The choice of split ratio is not arbitrary. Too much in test = not enough to learn from. Too little in test = the score becomes unstable (depends on which 50 rows ended up there). The table below shows the same LogisticRegression trained on a synthetic 1,000-row classification dataset, varying only the test_size.

Same dataset, four different split ratios
test_sizeTrain rowsTest rowsTest accuracy (5 reruns, mean ± std)Comment
0.50 (50/50)5005000.842 ± 0.011Score very stable, but the model has only half the data to learn from — underfits on small datasets
0.30 (70/30)7003000.861 ± 0.014Solid choice for small datasets
0.20 (80/20)8002000.873 ± 0.016Default sweet spot for most projects
0.05 (95/5)950500.880 ± 0.052Higher mean — but the ± 0.052 std means a single rerun can show 0.83 or 0.93. Score becomes unreliable
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=20, random_state=0)
for test_size in [0.50, 0.30, 0.20, 0.05]:
scores = []
for seed in range(5):
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=test_size, random_state=seed)
m = LogisticRegression(max_iter=1000).fit(X_tr, y_tr)
scores.append(m.score(X_te, y_te))
print(f"test_size={test_size}: {sum(scores)/5:.3f} ± {(max(scores)-min(scores))/2:.3f}")

Two takeaways:

  1. 80/20 is a compromise, not a universal law — it gives enough data to learn from and enough test points to trust the score.
  2. A “great” score on a tiny test set is suspect. With 50 rows in test, getting one extra prediction right changes accuracy by 2 percentage points. Always look at variance, not just the mean.

The training trio: train, validation, test

Section titled “The training trio: train, validation, test”
SetPurposeWhen you touch it
TrainFit the modelContinuously
ValidationTune hyperparameters, compare modelsFrequently
TestFinal honest scoreOnce, at the very end

The test set is sacred — the moment you start optimising against it, you have just polluted it and you no longer have an honest estimate. If you do many experiments, use the validation set; the test set is reserved for the final report.

Cross-validation — a smarter alternative

Section titled “Cross-validation — a smarter alternative”

Instead of a single validation set, split the training data into K folds, train K times leaving one fold out, and average the scores. This gives you a much more stable estimate.

flowchart TB
  T["Training data"]
  T --> F1["Fold 1 = val<br/>2,3,4,5 = train"]
  T --> F2["Fold 2 = val<br/>1,3,4,5 = train"]
  T --> F3["Fold 3 = val<br/>1,2,4,5 = train"]
  T --> F4["Fold 4 = val<br/>1,2,3,5 = train"]
  T --> F5["Fold 5 = val<br/>1,2,3,4 = train"]
  F1 --> S["Average<br/>the 5 scores"]
  F2 --> S
  F3 --> S
  F4 --> S
  F5 --> S
5-fold cross-validation — every example is used for both training AND validation, in different rounds.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
print(scores.mean(), '+/-', scores.std())

Use 5 or 10 folds by default. For time-series, use TimeSeriesSplit instead.

ProblemFirst model to tryWhy
Linear relationship, tabularLinear / Logistic regressionFast, interpretable baseline
Tabular, non-linear, mixed typesRandom ForestStrong default, no tuning needed
Tabular, want best scoreGradient Boosting (XGBoost, LightGBM)Wins most Kaggle competitions
Small dataset (under 1000 rows)k-NN, SVMFew parameters, robust
ImagesCNN (pretrained)Vision is solved by deep nets
TextTransformer (BERT, embeddings)Standard 2020+

Default rule: start with Logistic Regression (classification) or Linear Regression (regression) as a baseline. Then try Random Forest or Gradient Boosting. Beating a strong baseline is the goal.

The drumroll moment — one line:

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=200, random_state=42)
model.fit(X_train, y_train)

That’s it. Behind the scenes scikit-learn:

  1. Initialises 200 decision trees with random samples + random features.
  2. Grows each tree until it can’t split further (or until a depth limit).
  3. Stores them so they can vote at prediction time.

.predict() does the inverse — feeds new rows through every tree and averages.

y_pred = model.predict(X_test)

We now have predictions on the held-out test set. Next step: how good are they?

  • Never evaluate on data the model has seen → split.
  • Stratify on classification, split by date on time series.
  • The test set is sacred — touch it once.
  • Cross-validation gives a more stable estimate than a single validation set.
  • Start with a baseline model (Linear / Logistic), then escalate.
  • Training = model.fit(X_train, y_train). The work is everything around this line.

Next: Evaluation & tuning — picking the right metric and squeezing more juice from your model.