Skip to content

Features: target, encoding, scaling

Every supervised model expects two things:

  • y — the target (what you want to predict), a single column.
  • X — the features (what you feed in), a 2D matrix of numbers.
flowchart LR
  DF["Cleaned DataFrame<br/>(rows = examples<br/>cols = anything)"] --> Y["Target y<br/>1 column"]
  DF --> X["Features X<br/>only numbers"]
  classDef src fill:#ddd6fe,stroke:#7c3aed,color:#1e1b4b
  classDef target fill:#fde68a,stroke:#c2410c,color:#451a03
  classDef feat fill:#d1fae5,stroke:#047857,color:#064e3b
  DF:::src
  Y:::target
  X:::feat
Splitting a clean DataFrame into target and features — the moment ML really begins.

Three questions, one decision:

  1. What does the business need to predict?
  2. Is it a number or a category?
    • Number → regression (price, temperature, sales).
    • Category → classification (spam / not spam, churn / stay).
  3. Is the data available for past examples? If you can’t get historical y, you can’t train.
y = df['price'] # regression: continuous target
y = df['churned'] # classification: binary target

Picking the features X — what to include, what to drop

Section titled “Picking the features X — what to include, what to drop”

The temptation is to throw every column at the model. Resist:

KeepDrop
Columns causally linked to yIDs, timestamps with no signal
Columns available at prediction time”Future-leaking” columns
Columns with reasonable missingnessColumns 80% empty with no signal
Diverse signals (not duplicates)Two columns that are 0.99 correlated

Leakage warning: if a feature is only known after y happens (like “did the customer call support after churning?”), you’ll get a perfect score in training and 0% in production. Always ask “would I have this value before knowing y?”

X = df.drop(columns=['price', 'customer_id', 'data_load_date'])

A model only eats numbers. Three encoding strategies:

flowchart TB
  C["Categorical column"]
  C --> O["Ordinal?<br/>(low &lt; medium &lt; high)"]
  C --> N["Nominal?<br/>(red, blue, green)"]
  C --> H["High cardinality?<br/>(10k+ values)"]
  O --> OE["Ordinal encoding<br/>{low: 0, medium: 1, high: 2}"]
  N --> OH["One-hot encoding<br/>3 binary columns"]
  H --> TE["Target encoding<br/>or embedding"]
Three flavours of categorical encoding — pick by the kind of column.
import pandas as pd
X = pd.get_dummies(X, columns=['country', 'product'])
# Adds country_FR, country_US, ... binary columns

Use for nominal variables (no order) with fewer than 50 distinct values.

from sklearn.preprocessing import OrdinalEncoder
order = [['low', 'medium', 'high']]
X['priority'] = OrdinalEncoder(categories=order).fit_transform(X[['priority']])

Use only when the order is real (T-shirt sizes, satisfaction scores).

Replace each category with the mean of y for that category. Powerful, but easy to overfit — always do it inside cross-validation.

from category_encoders import TargetEncoder
X['zipcode'] = TargetEncoder().fit_transform(X['zipcode'], y)
The same column under the three encodings, side by side

Suppose a dataset has a colour column with three values (red, blue, green) and a binary target bought indicating whether the customer purchased:

Rowcolourbought (y)
1red1
2red1
3red0
4blue1
5blue0
6green0
7green0

Each encoding turns this single column into a different numeric representation:

RowOne-hot encodingOrdinal encodingTarget encoding (mean of y per category)
c_red, c_blue, c_greencolour_ordcolour_te
1 (red)1, 0, 002/3 ≈ 0.67
2 (red)1, 0, 000.67
3 (red)1, 0, 000.67
4 (blue)0, 1, 011/2 = 0.50
5 (blue)0, 1, 010.50
6 (green)0, 0, 120/2 = 0.00
7 (green)0, 0, 120.00
PropertyOne-hotOrdinalTarget encoding
Number of columns produced3 (= cardinality)11
Introduces a fake order?NoYes — model will think green > blue > redNo
Carries signal about y?No (the model has to learn it)NoYes — already pre-aggregated
Risk of overfit / leakage?LowLowHigh — must be computed inside cross-validation, not on the full dataset
Behaviour at high cardinality (10,000 zip codes)Disaster (10,000 columns)OK but meaninglessBest fit — stays 1 column
import pandas as pd
df = pd.DataFrame({
'colour': ['red','red','red','blue','blue','green','green'],
'bought': [1, 1, 0, 1, 0, 0, 0],
})
# 1) one-hot
print(pd.get_dummies(df, columns=['colour']))
# 2) ordinal
df['colour_ord'] = df['colour'].map({'red': 0, 'blue': 1, 'green': 2})
print(df)
# 3) target encoding (computed by category mean, simplified)
te = df.groupby('colour')['bought'].mean()
df['colour_te'] = df['colour'].map(te)
print(df)

Three takeaways:

  1. One-hot is the safe default: no order is implied, easy to interpret. The price is the column explosion at high cardinality.
  2. Ordinal only works if the order is real. Applied to red / blue / green, the model gets a false signal (green > red) that it will exploit silently.
  3. Target encoding leaks if used naïvely. Computing the mean of y on the same rows the model trains on lets it peek at the answer. In production it must be computed on a separate fold (or smoothed with a regulariser like category_encoders.TargetEncoder does internally).

Some models care a lot about scale, others not at all.

AlgorithmCares about scale?
Linear regression / logisticYes
SVM, k-NN, k-meansYes (a lot)
Neural networksYes
Decision trees / random forestNo
Gradient boosting (XGBoost, LightGBM)No

Two main scalers:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Default choice for most algorithms.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)

Use when you need bounded output (image pixels, sigmoid input).

Critical rule: fit on training data only, then transform both train and test. Fitting on test data is a form of leakage and will silently lie about your score.

scikit-learn’s Pipeline keeps preprocessing and model together, so they always get the same treatment:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
num_cols = ['age', 'income']
cat_cols = ['country', 'product']
preprocess = ColumnTransformer([
('num', StandardScaler(), num_cols),
('cat', OneHotEncoder(), cat_cols),
])
pipe = Pipeline([
('prep', preprocess),
('model', LogisticRegression()),
])
pipe.fit(X_train, y_train) # one line — all preprocessing included

This is the best practice — once you write a pipeline, you cannot accidentally leak or forget a transformation in production.

  • y is the target (regression vs classification); X is the feature matrix.
  • Leakage is the most expensive bug — always ask “would I know this before y?”
  • Categoricals: one-hot for nominal, ordinal when order is real, target encoding for high cardinality.
  • Scale numerics for linear / SVM / NN; trees don’t need it.
  • Pipelines beat manual preprocessing because they prevent leakage.

Next: Train / test split & training — splitting the data honestly and fitting the model.