Train/test split & training

Why you must split

A model that has memorised the answers will look brilliant — until it meets data it has never seen. To know if your model is genuinely smart or just overfitting, you must hold back a slice of the data the model never touches during training.

flowchart LR
  D["Full dataset<br/>(100%)"]
  D --> Tr["Train<br/>(70–80%)"]
  D --> Va["Validation<br/>(10–15%)<br/>(optional)"]
  D --> Te["Test<br/>(10–20%)"]
  Tr -.->|"fit the model"| M["Model"]
  Va -.->|"tune hyperparameters"| M
  Te -.->|"final score (only ONCE)"| S["Honest performance"]
  classDef train fill:#dbeafe,stroke:#2563eb
  classDef val fill:#fef3c7,stroke:#c2410c
  classDef test fill:#fee2e2,stroke:#dc2626
  Tr:::train
  Va:::val
  Te:::test

The three slices of your data. The test set is sacred — touch it only at the end.

The split, in three flavours

1. Simple random split (default)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Use when rows are independent and identically distributed — most tabular problems.

2. Stratified split (imbalanced classes)

If your y is “fraud / not fraud” with 99% / 1%, a random split might give a test set with zero fraud rows. Stratify keeps the same ratio in train and test:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

Always stratify on classification problems.

3. Time-based split

If the data has a time axis (sales, sensors, user activity), a random split is leakage — the model would see future and predict past. Split by date:

X_train = df[df['date'] <  '2025-01-01']
X_test  = df[df['date'] >= '2025-01-01']

Why 80/20 and not 50/50 or 95/5? — a concrete look

The choice of split ratio is not arbitrary. Too much in test = not enough to learn from. Too little in test = the score becomes unstable (depends on which 50 rows ended up there). The table below shows the same LogisticRegression trained on a synthetic 1,000-row classification dataset, varying only the test_size.

Same dataset, four different split ratios

`test_size`	Train rows	Test rows	Test accuracy (5 reruns, mean ± std)	Comment
`0.50` (50/50)	500	500	0.842 ± 0.011	Score very stable, but the model has only half the data to learn from — underfits on small datasets
`0.30` (70/30)	700	300	0.861 ± 0.014	Solid choice for small datasets
`0.20` (80/20)	800	200	0.873 ± 0.016	Default sweet spot for most projects
`0.05` (95/5)	950	50	0.880 ± 0.052	Higher mean — but the ± 0.052 std means a single rerun can show 0.83 or 0.93. Score becomes unreliable

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=20, random_state=0)
for test_size in [0.50, 0.30, 0.20, 0.05]:
    scores = []
    for seed in range(5):
        X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=test_size, random_state=seed)
        m = LogisticRegression(max_iter=1000).fit(X_tr, y_tr)
        scores.append(m.score(X_te, y_te))
    print(f"test_size={test_size}: {sum(scores)/5:.3f} ± {(max(scores)-min(scores))/2:.3f}")

Two takeaways:

80/20 is a compromise, not a universal law — it gives enough data to learn from and enough test points to trust the score.
A “great” score on a tiny test set is suspect. With 50 rows in test, getting one extra prediction right changes accuracy by 2 percentage points. Always look at variance, not just the mean.

The training trio: train, validation, test

Set	Purpose	When you touch it
Train	Fit the model	Continuously
Validation	Tune hyperparameters, compare models	Frequently
Test	Final honest score	Once, at the very end

The test set is sacred — the moment you start optimising against it, you have just polluted it and you no longer have an honest estimate. If you do many experiments, use the validation set; the test set is reserved for the final report.

Cross-validation — a smarter alternative

Instead of a single validation set, split the training data into K folds, train K times leaving one fold out, and average the scores. This gives you a much more stable estimate.

flowchart TB
  T["Training data"]
  T --> F1["Fold 1 = val<br/>2,3,4,5 = train"]
  T --> F2["Fold 2 = val<br/>1,3,4,5 = train"]
  T --> F3["Fold 3 = val<br/>1,2,4,5 = train"]
  T --> F4["Fold 4 = val<br/>1,2,3,5 = train"]
  T --> F5["Fold 5 = val<br/>1,2,3,4 = train"]
  F1 --> S["Average<br/>the 5 scores"]
  F2 --> S
  F3 --> S
  F4 --> S
  F5 --> S

5-fold cross-validation — every example is used for both training AND validation, in different rounds.

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
print(scores.mean(), '+/-', scores.std())

Use 5 or 10 folds by default. For time-series, use TimeSeriesSplit instead.

Choosing a model — a tiny cheat sheet

Problem	First model to try	Why
Linear relationship, tabular	Linear / Logistic regression	Fast, interpretable baseline
Tabular, non-linear, mixed types	Random Forest	Strong default, no tuning needed
Tabular, want best score	Gradient Boosting (XGBoost, LightGBM)	Wins most Kaggle competitions
Small dataset (under 1000 rows)	k-NN, SVM	Few parameters, robust
Images	CNN (pretrained)	Vision is solved by deep nets
Text	Transformer (BERT, embeddings)	Standard 2020+

Default rule: start with Logistic Regression (classification) or Linear Regression (regression) as a baseline. Then try Random Forest or Gradient Boosting. Beating a strong baseline is the goal.

Actually training the model

The drumroll moment — one line:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=200, random_state=42)
model.fit(X_train, y_train)

That’s it. Behind the scenes scikit-learn:

Initialises 200 decision trees with random samples + random features.
Grows each tree until it can’t split further (or until a depth limit).
Stores them so they can vote at prediction time.

.predict() does the inverse — feeds new rows through every tree and averages.

y_pred = model.predict(X_test)

We now have predictions on the held-out test set. Next step: how good are they?

Key takeaways

Never evaluate on data the model has seen → split.
Stratify on classification, split by date on time series.
The test set is sacred — touch it once.
Cross-validation gives a more stable estimate than a single validation set.
Start with a baseline model (Linear / Logistic), then escalate.
Training = model.fit(X_train, y_train). The work is everything around this line.

Next: Evaluation & tuning — picking the right metric and squeezing more juice from your model.