Skip to content

Evaluation & tuning

If 99% of your emails are not spam and your model predicts “not spam” for everything, your accuracy is 99% — and your model is useless. Picking the right metric is half the battle.

flowchart TB
  P["What kind of problem?"]
  P -->|"Regression"| R["MAE · RMSE · R²"]
  P -->|"Balanced classification"| BC["Accuracy · F1"]
  P -->|"Imbalanced classification"| IC["Precision · Recall · F1 · ROC-AUC · PR-AUC"]
  P -->|"Cost-sensitive (medical, fraud)"| CS["Recall (catch positives)<br/>+ business cost"]
Choose the metric that matches the question — never the one that gives the prettiest number.

Classification metrics — the confusion matrix

Section titled “Classification metrics — the confusion matrix”

Every classification metric is derived from the confusion matrix:

Predicted positivePredicted negative
Actual positiveTrue Positive (TP)False Negative (FN)
Actual negativeFalse Positive (FP)True Negative (TN)

From these four numbers:

MetricFormulaQuestion it answers
Accuracy(TP + TN) / total”Overall, how often is the model right?”
PrecisionTP / (TP + FP)“When I say positive, am I right?”
RecallTP / (TP + FN)“Do I catch all the positives?”
F12·P·R / (P+R)“Balance between precision and recall.”
ROC-AUC(curve)“How well does the model rank positives vs negatives?”
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
A worked example — spam detection on 1,000 emails

Suppose a spam filter has been evaluated on 1,000 emails: 100 are actually spam (positive class), 900 are legitimate (negative class). The confusion matrix looks like:

Predicted spamPredicted legitimate
Actual spam (100)TP = 80FN = 20
Actual legitimate (900)FP = 30TN = 870

Computing each metric:

MetricFormulaCalculationValueReads as
Accuracy(TP + TN) / total(80 + 870) / 10000.95”95% of emails correctly labelled” — sounds great
PrecisionTP / (TP + FP)80 / (80 + 30)0.73”When the filter says spam, it’s right 73% of the time”
RecallTP / (TP + FN)80 / (80 + 20)0.80”The filter catches 80% of real spam — 20% slips through”
F12·P·R / (P+R)2·0.73·0.80 / (0.73 + 0.80)0.76”Balanced view of precision and recall”

Three observations:

  1. Accuracy looks great (95%) but masks the problem. A trivial model that always predicts “legitimate” would score 90% — only 5 points worse, with zero useful behaviour.
  2. Precision 0.73 means 27 legitimate emails out of 110 flagged are wrongly banished to spam — that’s the user-experience cost.
  3. Recall 0.80 means 20 spam emails out of 100 reach the inbox. Different cost. A medical screening team would treat these 20 missed positives very differently from 27 false alarms.

The same model can be called “acceptable” (recall-driven team) or “unusable” (precision-driven team). The metric is the conversation.

  • Cancer screening, fraud detection → optimise recall (don’t miss positives).
  • Spam filter, content moderation → optimise precision (don’t ban innocents).
  • General-purpose → optimise F1 or business cost.
MetricFormulaReads as
MAE (Mean Absolute Error)average of `y - ŷ
RMSEsqrt(average of (y - ŷ)²)“MAE that punishes big errors more.”
1 − SS_res / SS_tot”Fraction of variance explained. 0 = baseline, 1 = perfect.”
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

Rule of thumb: report MAE (interpretable, in the unit of y) and R² (relative to a naive baseline).

Most models have knobs that aren’t learned from data — you have to choose them. Examples:

  • Random forest: n_estimators, max_depth, min_samples_split
  • Gradient boosting: learning_rate, n_estimators, max_depth
  • SVM: C, kernel, gamma

Three search strategies:

flowchart LR
  G["Grid search<br/>try every combo"] --> R["Random search<br/>try N random combos"] --> B["Bayesian search<br/>learn from past tries"]
  classDef slow fill:#fee2e2,stroke:#dc2626
  classDef med fill:#fef3c7,stroke:#c2410c
  classDef fast fill:#d1fae5,stroke:#047857
  G:::slow
  R:::med
  B:::fast
From slowest/most thorough (grid) to smartest (Bayesian). For most projects, random search is the sweet spot.
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200, 500],
'max_depth': [None, 10, 20],
}
gs = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='f1')
gs.fit(X_train, y_train)
print(gs.best_params_)
print(gs.best_score_)
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
dist = {
'n_estimators': randint(50, 500),
'max_depth': randint(3, 30),
}
rs = RandomizedSearchCV(RandomForestClassifier(), dist, n_iter=40, cv=5, scoring='f1')
rs.fit(X_train, y_train)

Important: hyperparameter search uses cross-validation on the training set. The test set stays untouched.

Train several candidates the same way (same preprocessing, same CV, same metric):

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
models = {
'logreg': LogisticRegression(max_iter=1000),
'rf': RandomForestClassifier(n_estimators=200),
'gboost': GradientBoostingClassifier(),
}
for name, m in models.items():
scores = cross_val_score(m, X_train, y_train, cv=5, scoring='f1')
print(f"{name:7s} {scores.mean():.3f} ± {scores.std():.3f}")

Pick the model that wins on the metric you care about AND has reasonable variance. A model with 0.85 ± 0.02 is usually better than 0.86 ± 0.10.

A model that scores 95% but does it for the wrong reason is dangerous in production. Two cheap interpretability tools:

import pandas as pd
imp = pd.Series(model.feature_importances_, index=X.columns)
imp.sort_values(ascending=False).head(10).plot.barh()

If your top feature is customer_id, you have a leakage problem.

SHAP values — per-prediction explanations

Section titled “SHAP values — per-prediction explanations”
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)

SHAP tells you, for each prediction, how much each feature pushed the answer up or down. Indispensable for any model that affects humans (loans, hiring, medicine).

Once tuning and comparison are done, take your single best model and run it once on the test set:

final_score = best_model.score(X_test, y_test)

That’s the number you report. Anything else is self-deception.

  • Accuracy is misleading on imbalanced data — use precision / recall / F1.
  • For regression: MAE for interpretability, R² for context.
  • Hyperparameter search uses cross-validation; the test set stays sacred.
  • Compare models with the same CV protocol and look at mean ± std.
  • Interpret your model (importance, SHAP) before shipping.

Next: Deployment & monitoring — getting the model into production and keeping it alive.