Evaluation & tuning
Accuracy is a lie (most of the time)
Section titled “Accuracy is a lie (most of the time)”If 99% of your emails are not spam and your model predicts “not spam” for everything, your accuracy is 99% — and your model is useless. Picking the right metric is half the battle.
flowchart TB P["What kind of problem?"] P -->|"Regression"| R["MAE · RMSE · R²"] P -->|"Balanced classification"| BC["Accuracy · F1"] P -->|"Imbalanced classification"| IC["Precision · Recall · F1 · ROC-AUC · PR-AUC"] P -->|"Cost-sensitive (medical, fraud)"| CS["Recall (catch positives)<br/>+ business cost"]
Classification metrics — the confusion matrix
Section titled “Classification metrics — the confusion matrix”Every classification metric is derived from the confusion matrix:
| Predicted positive | Predicted negative | |
|---|---|---|
| Actual positive | True Positive (TP) | False Negative (FN) |
| Actual negative | False Positive (FP) | True Negative (TN) |
From these four numbers:
| Metric | Formula | Question it answers |
|---|---|---|
| Accuracy | (TP + TN) / total | ”Overall, how often is the model right?” |
| Precision | TP / (TP + FP) | “When I say positive, am I right?” |
| Recall | TP / (TP + FN) | “Do I catch all the positives?” |
| F1 | 2·P·R / (P+R) | “Balance between precision and recall.” |
| ROC-AUC | (curve) | “How well does the model rank positives vs negatives?” |
from sklearn.metrics import classification_report, confusion_matrixprint(confusion_matrix(y_test, y_pred))print(classification_report(y_test, y_pred))A worked example — spam detection on 1,000 emails
Suppose a spam filter has been evaluated on 1,000 emails: 100 are actually spam (positive class), 900 are legitimate (negative class). The confusion matrix looks like:
| Predicted spam | Predicted legitimate | |
|---|---|---|
| Actual spam (100) | TP = 80 | FN = 20 |
| Actual legitimate (900) | FP = 30 | TN = 870 |
Computing each metric:
| Metric | Formula | Calculation | Value | Reads as |
|---|---|---|---|---|
| Accuracy | (TP + TN) / total | (80 + 870) / 1000 | 0.95 | ”95% of emails correctly labelled” — sounds great |
| Precision | TP / (TP + FP) | 80 / (80 + 30) | 0.73 | ”When the filter says spam, it’s right 73% of the time” |
| Recall | TP / (TP + FN) | 80 / (80 + 20) | 0.80 | ”The filter catches 80% of real spam — 20% slips through” |
| F1 | 2·P·R / (P+R) | 2·0.73·0.80 / (0.73 + 0.80) | 0.76 | ”Balanced view of precision and recall” |
Three observations:
- Accuracy looks great (95%) but masks the problem. A trivial model that always predicts “legitimate” would score 90% — only 5 points worse, with zero useful behaviour.
- Precision 0.73 means 27 legitimate emails out of 110 flagged are wrongly banished to spam — that’s the user-experience cost.
- Recall 0.80 means 20 spam emails out of 100 reach the inbox. Different cost. A medical screening team would treat these 20 missed positives very differently from 27 false alarms.
The same model can be called “acceptable” (recall-driven team) or “unusable” (precision-driven team). The metric is the conversation.
When to optimise which
Section titled “When to optimise which”- Cancer screening, fraud detection → optimise recall (don’t miss positives).
- Spam filter, content moderation → optimise precision (don’t ban innocents).
- General-purpose → optimise F1 or business cost.
Regression metrics
Section titled “Regression metrics”| Metric | Formula | Reads as |
|---|---|---|
| MAE (Mean Absolute Error) | average of ` | y - ŷ |
| RMSE | sqrt(average of (y - ŷ)²) | “MAE that punishes big errors more.” |
| R² | 1 − SS_res / SS_tot | ”Fraction of variance explained. 0 = baseline, 1 = perfect.” |
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_scoremae = mean_absolute_error(y_test, y_pred)rmse = mean_squared_error(y_test, y_pred, squared=False)r2 = r2_score(y_test, y_pred)Rule of thumb: report MAE (interpretable, in the unit of y) and R² (relative to a naive baseline).
Hyperparameter tuning
Section titled “Hyperparameter tuning”Most models have knobs that aren’t learned from data — you have to choose them. Examples:
- Random forest:
n_estimators,max_depth,min_samples_split - Gradient boosting:
learning_rate,n_estimators,max_depth - SVM:
C,kernel,gamma
Three search strategies:
flowchart LR G["Grid search<br/>try every combo"] --> R["Random search<br/>try N random combos"] --> B["Bayesian search<br/>learn from past tries"] classDef slow fill:#fee2e2,stroke:#dc2626 classDef med fill:#fef3c7,stroke:#c2410c classDef fast fill:#d1fae5,stroke:#047857 G:::slow R:::med B:::fast
Grid search example
Section titled “Grid search example”from sklearn.model_selection import GridSearchCV
param_grid = { 'n_estimators': [100, 200, 500], 'max_depth': [None, 10, 20],}gs = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='f1')gs.fit(X_train, y_train)
print(gs.best_params_)print(gs.best_score_)Random search (recommended)
Section titled “Random search (recommended)”from sklearn.model_selection import RandomizedSearchCVfrom scipy.stats import randint
dist = { 'n_estimators': randint(50, 500), 'max_depth': randint(3, 30),}rs = RandomizedSearchCV(RandomForestClassifier(), dist, n_iter=40, cv=5, scoring='f1')rs.fit(X_train, y_train)Important: hyperparameter search uses cross-validation on the training set. The test set stays untouched.
Comparing models honestly
Section titled “Comparing models honestly”Train several candidates the same way (same preprocessing, same CV, same metric):
from sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.model_selection import cross_val_score
models = { 'logreg': LogisticRegression(max_iter=1000), 'rf': RandomForestClassifier(n_estimators=200), 'gboost': GradientBoostingClassifier(),}
for name, m in models.items(): scores = cross_val_score(m, X_train, y_train, cv=5, scoring='f1') print(f"{name:7s} {scores.mean():.3f} ± {scores.std():.3f}")Pick the model that wins on the metric you care about AND has reasonable variance. A model with 0.85 ± 0.02 is usually better than 0.86 ± 0.10.
Interpreting results
Section titled “Interpreting results”A model that scores 95% but does it for the wrong reason is dangerous in production. Two cheap interpretability tools:
Feature importance
Section titled “Feature importance”import pandas as pdimp = pd.Series(model.feature_importances_, index=X.columns)imp.sort_values(ascending=False).head(10).plot.barh()If your top feature is customer_id, you have a leakage problem.
SHAP values — per-prediction explanations
Section titled “SHAP values — per-prediction explanations”import shapexplainer = shap.TreeExplainer(model)shap_values = explainer.shap_values(X_test)shap.summary_plot(shap_values, X_test)SHAP tells you, for each prediction, how much each feature pushed the answer up or down. Indispensable for any model that affects humans (loans, hiring, medicine).
The final test
Section titled “The final test”Once tuning and comparison are done, take your single best model and run it once on the test set:
final_score = best_model.score(X_test, y_test)That’s the number you report. Anything else is self-deception.
Key takeaways
Section titled “Key takeaways”- Accuracy is misleading on imbalanced data — use precision / recall / F1.
- For regression: MAE for interpretability, R² for context.
- Hyperparameter search uses cross-validation; the test set stays sacred.
- Compare models with the same CV protocol and look at mean ± std.
- Interpret your model (importance, SHAP) before shipping.
Next: Deployment & monitoring — getting the model into production and keeping it alive.