Features: target, encoding, scaling
From DataFrame to X, y
Section titled “From DataFrame to X, y”Every supervised model expects two things:
y— the target (what you want to predict), a single column.X— the features (what you feed in), a 2D matrix of numbers.
flowchart LR DF["Cleaned DataFrame<br/>(rows = examples<br/>cols = anything)"] --> Y["Target y<br/>1 column"] DF --> X["Features X<br/>only numbers"] classDef src fill:#ddd6fe,stroke:#7c3aed,color:#1e1b4b classDef target fill:#fde68a,stroke:#c2410c,color:#451a03 classDef feat fill:#d1fae5,stroke:#047857,color:#064e3b DF:::src Y:::target X:::feat
Picking the target y
Section titled “Picking the target y”Three questions, one decision:
- What does the business need to predict?
- Is it a number or a category?
- Number → regression (price, temperature, sales).
- Category → classification (spam / not spam, churn / stay).
- Is the data available for past examples? If you can’t get historical
y, you can’t train.
y = df['price'] # regression: continuous targety = df['churned'] # classification: binary targetPicking the features X — what to include, what to drop
Section titled “Picking the features X — what to include, what to drop”The temptation is to throw every column at the model. Resist:
| Keep | Drop |
|---|---|
Columns causally linked to y | IDs, timestamps with no signal |
| Columns available at prediction time | ”Future-leaking” columns |
| Columns with reasonable missingness | Columns 80% empty with no signal |
| Diverse signals (not duplicates) | Two columns that are 0.99 correlated |
Leakage warning: if a feature is only known after
yhappens (like “did the customer call support after churning?”), you’ll get a perfect score in training and 0% in production. Always ask “would I have this value before knowing y?”
X = df.drop(columns=['price', 'customer_id', 'data_load_date'])Encoding categorical variables
Section titled “Encoding categorical variables”A model only eats numbers. Three encoding strategies:
flowchart TB
C["Categorical column"]
C --> O["Ordinal?<br/>(low < medium < high)"]
C --> N["Nominal?<br/>(red, blue, green)"]
C --> H["High cardinality?<br/>(10k+ values)"]
O --> OE["Ordinal encoding<br/>{low: 0, medium: 1, high: 2}"]
N --> OH["One-hot encoding<br/>3 binary columns"]
H --> TE["Target encoding<br/>or embedding"]
One-hot encoding (the safe default)
Section titled “One-hot encoding (the safe default)”import pandas as pdX = pd.get_dummies(X, columns=['country', 'product'])# Adds country_FR, country_US, ... binary columnsUse for nominal variables (no order) with fewer than 50 distinct values.
Ordinal encoding
Section titled “Ordinal encoding”from sklearn.preprocessing import OrdinalEncoderorder = [['low', 'medium', 'high']]X['priority'] = OrdinalEncoder(categories=order).fit_transform(X[['priority']])Use only when the order is real (T-shirt sizes, satisfaction scores).
Target encoding (high cardinality)
Section titled “Target encoding (high cardinality)”Replace each category with the mean of y for that category. Powerful, but easy to overfit — always do it inside cross-validation.
from category_encoders import TargetEncoderX['zipcode'] = TargetEncoder().fit_transform(X['zipcode'], y)The same column under the three encodings, side by side
Suppose a dataset has a colour column with three values (red, blue, green) and a binary target bought indicating whether the customer purchased:
| Row | colour | bought (y) |
|---|---|---|
| 1 | red | 1 |
| 2 | red | 1 |
| 3 | red | 0 |
| 4 | blue | 1 |
| 5 | blue | 0 |
| 6 | green | 0 |
| 7 | green | 0 |
Each encoding turns this single column into a different numeric representation:
| Row | One-hot encoding | Ordinal encoding | Target encoding (mean of y per category) |
|---|---|---|---|
c_red, c_blue, c_green | colour_ord | colour_te | |
| 1 (red) | 1, 0, 0 | 0 | 2/3 ≈ 0.67 |
| 2 (red) | 1, 0, 0 | 0 | 0.67 |
| 3 (red) | 1, 0, 0 | 0 | 0.67 |
| 4 (blue) | 0, 1, 0 | 1 | 1/2 = 0.50 |
| 5 (blue) | 0, 1, 0 | 1 | 0.50 |
| 6 (green) | 0, 0, 1 | 2 | 0/2 = 0.00 |
| 7 (green) | 0, 0, 1 | 2 | 0.00 |
| Property | One-hot | Ordinal | Target encoding |
|---|---|---|---|
| Number of columns produced | 3 (= cardinality) | 1 | 1 |
| Introduces a fake order? | No | Yes — model will think green > blue > red | No |
Carries signal about y? | No (the model has to learn it) | No | Yes — already pre-aggregated |
| Risk of overfit / leakage? | Low | Low | High — must be computed inside cross-validation, not on the full dataset |
| Behaviour at high cardinality (10,000 zip codes) | Disaster (10,000 columns) | OK but meaningless | Best fit — stays 1 column |
import pandas as pd
df = pd.DataFrame({ 'colour': ['red','red','red','blue','blue','green','green'], 'bought': [1, 1, 0, 1, 0, 0, 0],})
# 1) one-hotprint(pd.get_dummies(df, columns=['colour']))
# 2) ordinaldf['colour_ord'] = df['colour'].map({'red': 0, 'blue': 1, 'green': 2})print(df)
# 3) target encoding (computed by category mean, simplified)te = df.groupby('colour')['bought'].mean()df['colour_te'] = df['colour'].map(te)print(df)Three takeaways:
- One-hot is the safe default: no order is implied, easy to interpret. The price is the column explosion at high cardinality.
- Ordinal only works if the order is real. Applied to
red / blue / green, the model gets a false signal (green > red) that it will exploit silently. - Target encoding leaks if used naïvely. Computing the mean of
yon the same rows the model trains on lets it peek at the answer. In production it must be computed on a separate fold (or smoothed with a regulariser likecategory_encoders.TargetEncoderdoes internally).
Scaling numeric features
Section titled “Scaling numeric features”Some models care a lot about scale, others not at all.
| Algorithm | Cares about scale? |
|---|---|
| Linear regression / logistic | Yes |
| SVM, k-NN, k-means | Yes (a lot) |
| Neural networks | Yes |
| Decision trees / random forest | No |
| Gradient boosting (XGBoost, LightGBM) | No |
Two main scalers:
Standardisation — mean 0, std 1
Section titled “Standardisation — mean 0, std 1”from sklearn.preprocessing import StandardScalerscaler = StandardScaler()X_train = scaler.fit_transform(X_train)X_test = scaler.transform(X_test)Default choice for most algorithms.
Min-Max — squashed to [0, 1]
Section titled “Min-Max — squashed to [0, 1]”from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()X_train = scaler.fit_transform(X_train)Use when you need bounded output (image pixels, sigmoid input).
Critical rule:
fiton training data only, thentransformboth train and test. Fitting on test data is a form of leakage and will silently lie about your score.
The whole transformation in one Pipeline
Section titled “The whole transformation in one Pipeline”scikit-learn’s Pipeline keeps preprocessing and model together, so they always get the same treatment:
from sklearn.pipeline import Pipelinefrom sklearn.compose import ColumnTransformerfrom sklearn.preprocessing import StandardScaler, OneHotEncoderfrom sklearn.linear_model import LogisticRegression
num_cols = ['age', 'income']cat_cols = ['country', 'product']
preprocess = ColumnTransformer([ ('num', StandardScaler(), num_cols), ('cat', OneHotEncoder(), cat_cols),])
pipe = Pipeline([ ('prep', preprocess), ('model', LogisticRegression()),])
pipe.fit(X_train, y_train) # one line — all preprocessing includedThis is the best practice — once you write a pipeline, you cannot accidentally leak or forget a transformation in production.
Key takeaways
Section titled “Key takeaways”yis the target (regression vs classification);Xis the feature matrix.- Leakage is the most expensive bug — always ask “would I know this before y?”
- Categoricals: one-hot for nominal, ordinal when order is real, target encoding for high cardinality.
- Scale numerics for linear / SVM / NN; trees don’t need it.
- Pipelines beat manual preprocessing because they prevent leakage.
Next: Train / test split & training — splitting the data honestly and fitting the model.