Features: target, encoding, scaling

From DataFrame to X, y

Every supervised model expects two things:

y — the target (what you want to predict), a single column.
X — the features (what you feed in), a 2D matrix of numbers.

flowchart LR
  DF["Cleaned DataFrame<br/>(rows = examples<br/>cols = anything)"] --> Y["Target y<br/>1 column"]
  DF --> X["Features X<br/>only numbers"]
  classDef src fill:#ddd6fe,stroke:#7c3aed,color:#1e1b4b
  classDef target fill:#fde68a,stroke:#c2410c,color:#451a03
  classDef feat fill:#d1fae5,stroke:#047857,color:#064e3b
  DF:::src
  Y:::target
  X:::feat

Splitting a clean DataFrame into target and features — the moment ML really begins.

Picking the target y

Three questions, one decision:

What does the business need to predict?
Is it a number or a category?
- Number → regression (price, temperature, sales).
- Category → classification (spam / not spam, churn / stay).
Is the data available for past examples? If you can’t get historical y, you can’t train.

y = df['price']     # regression: continuous target
y = df['churned']   # classification: binary target

Picking the features X — what to include, what to drop

The temptation is to throw every column at the model. Resist:

Keep	Drop
Columns causally linked to `y`	IDs, timestamps with no signal
Columns available at prediction time	”Future-leaking” columns
Columns with reasonable missingness	Columns 80% empty with no signal
Diverse signals (not duplicates)	Two columns that are 0.99 correlated

Leakage warning: if a feature is only known after y happens (like “did the customer call support after churning?”), you’ll get a perfect score in training and 0% in production. Always ask “would I have this value before knowing y?”

X = df.drop(columns=['price', 'customer_id', 'data_load_date'])

Encoding categorical variables

A model only eats numbers. Three encoding strategies:

flowchart TB
  C["Categorical column"]
  C --> O["Ordinal?<br/>(low &lt; medium &lt; high)"]
  C --> N["Nominal?<br/>(red, blue, green)"]
  C --> H["High cardinality?<br/>(10k+ values)"]
  O --> OE["Ordinal encoding<br/>{low: 0, medium: 1, high: 2}"]
  N --> OH["One-hot encoding<br/>3 binary columns"]
  H --> TE["Target encoding<br/>or embedding"]

Three flavours of categorical encoding — pick by the kind of column.

One-hot encoding (the safe default)

import pandas as pd
X = pd.get_dummies(X, columns=['country', 'product'])
# Adds country_FR, country_US, ... binary columns

Use for nominal variables (no order) with fewer than 50 distinct values.

Ordinal encoding

from sklearn.preprocessing import OrdinalEncoder
order = [['low', 'medium', 'high']]
X['priority'] = OrdinalEncoder(categories=order).fit_transform(X[['priority']])

Use only when the order is real (T-shirt sizes, satisfaction scores).

Target encoding (high cardinality)

Replace each category with the mean of y for that category. Powerful, but easy to overfit — always do it inside cross-validation.

from category_encoders import TargetEncoder
X['zipcode'] = TargetEncoder().fit_transform(X['zipcode'], y)

The same column under the three encodings, side by side

Suppose a dataset has a colour column with three values (red, blue, green) and a binary target bought indicating whether the customer purchased:

Row	`colour`	`bought` (y)
1	red	1
2	red	1
3	red	0
4	blue	1
5	blue	0
6	green	0
7	green	0

Each encoding turns this single column into a different numeric representation:

Row	One-hot encoding	Ordinal encoding	Target encoding (mean of y per category)
	`c_red`, `c_blue`, `c_green`	`colour_ord`	`colour_te`
1 (red)	`1, 0, 0`	`0`	`2/3 ≈ 0.67`
2 (red)	`1, 0, 0`	`0`	`0.67`
3 (red)	`1, 0, 0`	`0`	`0.67`
4 (blue)	`0, 1, 0`	`1`	`1/2 = 0.50`
5 (blue)	`0, 1, 0`	`1`	`0.50`
6 (green)	`0, 0, 1`	`2`	`0/2 = 0.00`
7 (green)	`0, 0, 1`	`2`	`0.00`

Property	One-hot	Ordinal	Target encoding
Number of columns produced	3 (= cardinality)	1	1
Introduces a fake order?	No	Yes — model will think `green > blue > red`	No
Carries signal about `y`?	No (the model has to learn it)	No	Yes — already pre-aggregated
Risk of overfit / leakage?	Low	Low	High — must be computed inside cross-validation, not on the full dataset
Behaviour at high cardinality (10,000 zip codes)	Disaster (10,000 columns)	OK but meaningless	Best fit — stays 1 column

import pandas as pd

df = pd.DataFrame({
    'colour':  ['red','red','red','blue','blue','green','green'],
    'bought':  [1, 1, 0, 1, 0, 0, 0],
})

# 1) one-hot
print(pd.get_dummies(df, columns=['colour']))

# 2) ordinal
df['colour_ord'] = df['colour'].map({'red': 0, 'blue': 1, 'green': 2})
print(df)

# 3) target encoding (computed by category mean, simplified)
te = df.groupby('colour')['bought'].mean()
df['colour_te'] = df['colour'].map(te)
print(df)

Three takeaways:

One-hot is the safe default: no order is implied, easy to interpret. The price is the column explosion at high cardinality.
Ordinal only works if the order is real. Applied to red / blue / green, the model gets a false signal (green > red) that it will exploit silently.
Target encoding leaks if used naïvely. Computing the mean of y on the same rows the model trains on lets it peek at the answer. In production it must be computed on a separate fold (or smoothed with a regulariser like category_encoders.TargetEncoder does internally).

Scaling numeric features

Some models care a lot about scale, others not at all.

Algorithm	Cares about scale?
Linear regression / logistic	Yes
SVM, k-NN, k-means	Yes (a lot)
Neural networks	Yes
Decision trees / random forest	No
Gradient boosting (XGBoost, LightGBM)	No

Two main scalers:

Standardisation — mean 0, std 1

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

Default choice for most algorithms.

Min-Max — squashed to [0, 1]

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)

Use when you need bounded output (image pixels, sigmoid input).

Critical rule: fit on training data only, then transform both train and test. Fitting on test data is a form of leakage and will silently lie about your score.

The whole transformation in one Pipeline

scikit-learn’s Pipeline keeps preprocessing and model together, so they always get the same treatment:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

num_cols = ['age', 'income']
cat_cols = ['country', 'product']

preprocess = ColumnTransformer([
    ('num', StandardScaler(),  num_cols),
    ('cat', OneHotEncoder(),   cat_cols),
])

pipe = Pipeline([
    ('prep', preprocess),
    ('model', LogisticRegression()),
])

pipe.fit(X_train, y_train)         # one line — all preprocessing included

This is the best practice — once you write a pipeline, you cannot accidentally leak or forget a transformation in production.

Key takeaways

y is the target (regression vs classification); X is the feature matrix.
Leakage is the most expensive bug — always ask “would I know this before y?”
Categoricals: one-hot for nominal, ordinal when order is real, target encoding for high cardinality.
Scale numerics for linear / SVM / NN; trees don’t need it.
Pipelines beat manual preprocessing because they prevent leakage.

Next: Train / test split & training — splitting the data honestly and fitting the model.