Train/test split & training
Why you must split
Section titled “Why you must split”A model that has memorised the answers will look brilliant — until it meets data it has never seen. To know if your model is genuinely smart or just overfitting, you must hold back a slice of the data the model never touches during training.
flowchart LR D["Full dataset<br/>(100%)"] D --> Tr["Train<br/>(70–80%)"] D --> Va["Validation<br/>(10–15%)<br/>(optional)"] D --> Te["Test<br/>(10–20%)"] Tr -.->|"fit the model"| M["Model"] Va -.->|"tune hyperparameters"| M Te -.->|"final score (only ONCE)"| S["Honest performance"] classDef train fill:#dbeafe,stroke:#2563eb classDef val fill:#fef3c7,stroke:#c2410c classDef test fill:#fee2e2,stroke:#dc2626 Tr:::train Va:::val Te:::test
The split, in three flavours
Section titled “The split, in three flavours”1. Simple random split (default)
Section titled “1. Simple random split (default)”from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)Use when rows are independent and identically distributed — most tabular problems.
2. Stratified split (imbalanced classes)
Section titled “2. Stratified split (imbalanced classes)”If your y is “fraud / not fraud” with 99% / 1%, a random split might give a test set with zero fraud rows. Stratify keeps the same ratio in train and test:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42)Always stratify on classification problems.
3. Time-based split
Section titled “3. Time-based split”If the data has a time axis (sales, sensors, user activity), a random split is leakage — the model would see future and predict past. Split by date:
X_train = df[df['date'] < '2025-01-01']X_test = df[df['date'] >= '2025-01-01']Why 80/20 and not 50/50 or 95/5? — a concrete look
Section titled “Why 80/20 and not 50/50 or 95/5? — a concrete look”The choice of split ratio is not arbitrary. Too much in test = not enough to learn from. Too little in test = the score becomes unstable (depends on which 50 rows ended up there). The table below shows the same LogisticRegression trained on a synthetic 1,000-row classification dataset, varying only the test_size.
Same dataset, four different split ratios
test_size | Train rows | Test rows | Test accuracy (5 reruns, mean ± std) | Comment |
|---|---|---|---|---|
0.50 (50/50) | 500 | 500 | 0.842 ± 0.011 | Score very stable, but the model has only half the data to learn from — underfits on small datasets |
0.30 (70/30) | 700 | 300 | 0.861 ± 0.014 | Solid choice for small datasets |
0.20 (80/20) | 800 | 200 | 0.873 ± 0.016 | Default sweet spot for most projects |
0.05 (95/5) | 950 | 50 | 0.880 ± 0.052 | Higher mean — but the ± 0.052 std means a single rerun can show 0.83 or 0.93. Score becomes unreliable |
from sklearn.datasets import make_classificationfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=20, random_state=0)for test_size in [0.50, 0.30, 0.20, 0.05]: scores = [] for seed in range(5): X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=test_size, random_state=seed) m = LogisticRegression(max_iter=1000).fit(X_tr, y_tr) scores.append(m.score(X_te, y_te)) print(f"test_size={test_size}: {sum(scores)/5:.3f} ± {(max(scores)-min(scores))/2:.3f}")Two takeaways:
- 80/20 is a compromise, not a universal law — it gives enough data to learn from and enough test points to trust the score.
- A “great” score on a tiny test set is suspect. With 50 rows in test, getting one extra prediction right changes accuracy by 2 percentage points. Always look at variance, not just the mean.
The training trio: train, validation, test
Section titled “The training trio: train, validation, test”| Set | Purpose | When you touch it |
|---|---|---|
| Train | Fit the model | Continuously |
| Validation | Tune hyperparameters, compare models | Frequently |
| Test | Final honest score | Once, at the very end |
The test set is sacred — the moment you start optimising against it, you have just polluted it and you no longer have an honest estimate. If you do many experiments, use the validation set; the test set is reserved for the final report.
Cross-validation — a smarter alternative
Section titled “Cross-validation — a smarter alternative”Instead of a single validation set, split the training data into K folds, train K times leaving one fold out, and average the scores. This gives you a much more stable estimate.
flowchart TB T["Training data"] T --> F1["Fold 1 = val<br/>2,3,4,5 = train"] T --> F2["Fold 2 = val<br/>1,3,4,5 = train"] T --> F3["Fold 3 = val<br/>1,2,4,5 = train"] T --> F4["Fold 4 = val<br/>1,2,3,5 = train"] T --> F5["Fold 5 = val<br/>1,2,3,4 = train"] F1 --> S["Average<br/>the 5 scores"] F2 --> S F3 --> S F4 --> S F5 --> S
from sklearn.model_selection import cross_val_scorescores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')print(scores.mean(), '+/-', scores.std())Use 5 or 10 folds by default. For time-series, use TimeSeriesSplit instead.
Choosing a model — a tiny cheat sheet
Section titled “Choosing a model — a tiny cheat sheet”| Problem | First model to try | Why |
|---|---|---|
| Linear relationship, tabular | Linear / Logistic regression | Fast, interpretable baseline |
| Tabular, non-linear, mixed types | Random Forest | Strong default, no tuning needed |
| Tabular, want best score | Gradient Boosting (XGBoost, LightGBM) | Wins most Kaggle competitions |
| Small dataset (under 1000 rows) | k-NN, SVM | Few parameters, robust |
| Images | CNN (pretrained) | Vision is solved by deep nets |
| Text | Transformer (BERT, embeddings) | Standard 2020+ |
Default rule: start with Logistic Regression (classification) or Linear Regression (regression) as a baseline. Then try Random Forest or Gradient Boosting. Beating a strong baseline is the goal.
Actually training the model
Section titled “Actually training the model”The drumroll moment — one line:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=200, random_state=42)model.fit(X_train, y_train)That’s it. Behind the scenes scikit-learn:
- Initialises 200 decision trees with random samples + random features.
- Grows each tree until it can’t split further (or until a depth limit).
- Stores them so they can vote at prediction time.
.predict() does the inverse — feeds new rows through every tree and averages.
y_pred = model.predict(X_test)We now have predictions on the held-out test set. Next step: how good are they?
Key takeaways
Section titled “Key takeaways”- Never evaluate on data the model has seen → split.
- Stratify on classification, split by date on time series.
- The test set is sacred — touch it once.
- Cross-validation gives a more stable estimate than a single validation set.
- Start with a baseline model (Linear / Logistic), then escalate.
- Training =
model.fit(X_train, y_train). The work is everything around this line.
Next: Evaluation & tuning — picking the right metric and squeezing more juice from your model.