Evaluation & tuning

Accuracy is a lie (most of the time)

If 99% of your emails are not spam and your model predicts “not spam” for everything, your accuracy is 99% — and your model is useless. Picking the right metric is half the battle.

flowchart TB
  P["What kind of problem?"]
  P -->|"Regression"| R["MAE · RMSE · R²"]
  P -->|"Balanced classification"| BC["Accuracy · F1"]
  P -->|"Imbalanced classification"| IC["Precision · Recall · F1 · ROC-AUC · PR-AUC"]
  P -->|"Cost-sensitive (medical, fraud)"| CS["Recall (catch positives)<br/>+ business cost"]

Choose the metric that matches the question — never the one that gives the prettiest number.

Classification metrics — the confusion matrix

Every classification metric is derived from the confusion matrix:

	Predicted positive	Predicted negative
Actual positive	True Positive (TP)	False Negative (FN)
Actual negative	False Positive (FP)	True Negative (TN)

From these four numbers:

Metric	Formula	Question it answers
Accuracy	(TP + TN) / total	”Overall, how often is the model right?”
Precision	TP / (TP + FP)	“When I say positive, am I right?”
Recall	TP / (TP + FN)	“Do I catch all the positives?”
F1	2·P·R / (P+R)	“Balance between precision and recall.”
ROC-AUC	(curve)	“How well does the model rank positives vs negatives?”

from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

A worked example — spam detection on 1,000 emails

Suppose a spam filter has been evaluated on 1,000 emails: 100 are actually spam (positive class), 900 are legitimate (negative class). The confusion matrix looks like:

	Predicted spam	Predicted legitimate
Actual spam (100)	TP = 80	FN = 20
Actual legitimate (900)	FP = 30	TN = 870

Computing each metric:

Metric	Formula	Calculation	Value	Reads as
Accuracy	(TP + TN) / total	(80 + 870) / 1000	0.95	”95% of emails correctly labelled” — sounds great
Precision	TP / (TP + FP)	80 / (80 + 30)	0.73	”When the filter says spam, it’s right 73% of the time”
Recall	TP / (TP + FN)	80 / (80 + 20)	0.80	”The filter catches 80% of real spam — 20% slips through”
F1	2·P·R / (P+R)	2·0.73·0.80 / (0.73 + 0.80)	0.76	”Balanced view of precision and recall”

Three observations:

Accuracy looks great (95%) but masks the problem. A trivial model that always predicts “legitimate” would score 90% — only 5 points worse, with zero useful behaviour.
Precision 0.73 means 27 legitimate emails out of 110 flagged are wrongly banished to spam — that’s the user-experience cost.
Recall 0.80 means 20 spam emails out of 100 reach the inbox. Different cost. A medical screening team would treat these 20 missed positives very differently from 27 false alarms.

The same model can be called “acceptable” (recall-driven team) or “unusable” (precision-driven team). The metric is the conversation.

When to optimise which

Cancer screening, fraud detection → optimise recall (don’t miss positives).
Spam filter, content moderation → optimise precision (don’t ban innocents).
General-purpose → optimise F1 or business cost.

Regression metrics

Metric	Formula	Reads as
MAE (Mean Absolute Error)	average of `	y - ŷ
RMSE	sqrt(average of `(y - ŷ)²`)	“MAE that punishes big errors more.”
R²	1 − SS_res / SS_tot	”Fraction of variance explained. 0 = baseline, 1 = perfect.”

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mae  = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

Rule of thumb: report MAE (interpretable, in the unit of y) and R² (relative to a naive baseline).

Hyperparameter tuning

Most models have knobs that aren’t learned from data — you have to choose them. Examples:

Random forest: n_estimators, max_depth, min_samples_split
Gradient boosting: learning_rate, n_estimators, max_depth
SVM: C, kernel, gamma

Three search strategies:

flowchart LR
  G["Grid search<br/>try every combo"] --> R["Random search<br/>try N random combos"] --> B["Bayesian search<br/>learn from past tries"]
  classDef slow fill:#fee2e2,stroke:#dc2626
  classDef med fill:#fef3c7,stroke:#c2410c
  classDef fast fill:#d1fae5,stroke:#047857
  G:::slow
  R:::med
  B:::fast

From slowest/most thorough (grid) to smartest (Bayesian). For most projects, random search is the sweet spot.

Grid search example

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth':    [None, 10, 20],
}
gs = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='f1')
gs.fit(X_train, y_train)

print(gs.best_params_)
print(gs.best_score_)

Random search (recommended)

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

dist = {
    'n_estimators': randint(50, 500),
    'max_depth':    randint(3, 30),
}
rs = RandomizedSearchCV(RandomForestClassifier(), dist, n_iter=40, cv=5, scoring='f1')
rs.fit(X_train, y_train)

Important: hyperparameter search uses cross-validation on the training set. The test set stays untouched.

Comparing models honestly

Train several candidates the same way (same preprocessing, same CV, same metric):

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

models = {
    'logreg':  LogisticRegression(max_iter=1000),
    'rf':      RandomForestClassifier(n_estimators=200),
    'gboost':  GradientBoostingClassifier(),
}

for name, m in models.items():
    scores = cross_val_score(m, X_train, y_train, cv=5, scoring='f1')
    print(f"{name:7s} {scores.mean():.3f} ± {scores.std():.3f}")

Pick the model that wins on the metric you care about AND has reasonable variance. A model with 0.85 ± 0.02 is usually better than 0.86 ± 0.10.

Interpreting results

A model that scores 95% but does it for the wrong reason is dangerous in production. Two cheap interpretability tools:

Feature importance

import pandas as pd
imp = pd.Series(model.feature_importances_, index=X.columns)
imp.sort_values(ascending=False).head(10).plot.barh()

If your top feature is customer_id, you have a leakage problem.

SHAP values — per-prediction explanations

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)

SHAP tells you, for each prediction, how much each feature pushed the answer up or down. Indispensable for any model that affects humans (loans, hiring, medicine).

The final test

Once tuning and comparison are done, take your single best model and run it once on the test set:

final_score = best_model.score(X_test, y_test)

That’s the number you report. Anything else is self-deception.

Key takeaways

Accuracy is misleading on imbalanced data — use precision / recall / F1.
For regression: MAE for interpretability, R² for context.
Hyperparameter search uses cross-validation; the test set stays sacred.
Compare models with the same CV protocol and look at mean ± std.
Interpret your model (importance, SHAP) before shipping.

Next: Deployment & monitoring — getting the model into production and keeping it alive.