Categorical encoding remains one of the most crucial preprocessing steps in machine learning, yet choosing the right method often feels like guesswork. This guide provides specific model recommendations, concrete examples, and practical datasets to demonstrate when and why each encoding method works best.
Most machine learning algorithms require numerical input, but real-world data contains categorical variables like product types, geographical regions, or customer segments. The encoding method you choose can dramatically impact model performance, sometimes by 10-20% or more.
These methods convert each category to a single numerical value, making them memory-efficient and suitable for tree-based models.
Categories are mapped to arbitrary integers without implying order.
Examples:
red → 1, blue → 2, green → 3London → 0, Paris → 1, Berlin → 2Specific Models That Excel:
Concrete Example: In the Titanic dataset, embarked ports can be label encoded.
import pandas as pd
from sklearn.datasets import fetch_openml
titanic = fetch_openml('titanic', version=1, as_frame=True)
df = titanic.frame
# Label encoding for 'embarked' column
embarked_mapping = {'S': 0, 'C': 1, 'Q': 2}
df['embarked_label'] = df['embarked'].map(embarked_mapping)
Categories are mapped to integers that reflect meaningful order relationships.
Examples:
Small → 1, Medium → 2, Large → 3, XL → 4Poor → 1, Fair → 2, Good → 3, Excellent → 4Specific Models That Leverage Order:
Concrete Example: Adult Census dataset education feature.
education_order = {
'Preschool': 1, '1st-4th': 2, '5th-6th': 3, '7th-8th': 4, '9th': 5,
'10th': 6, '11th': 7, '12th': 8, 'HS-grad': 9, 'Some-college': 10,
'Assoc-voc': 11, 'Assoc-acdm': 12, 'Bachelors': 13, 'Masters': 14,
'Prof-school': 15, 'Doctorate': 16
}
adult['education_ordinal'] = adult['education'].map(education_order)
Categories are replaced with their occurrence frequency in the dataset.
Examples:
Apple → 1000 (appears 1000 times), Orange → 500 (appears 500 times)Manager → 50, Developer → 200 (job titles by frequency)Specific Models That Excel:
Concrete Example: In the Titanic dataset, passenger class frequency correlates with survival rates.
freq_map = df['embarked'].value_counts().to_dict()
# Results: {'S': 644, 'C': 168, 'Q': 77}
df['embarked_freq'] = df['embarked'].map(freq_map)
Categories are replaced with the average target value for that category across the training set.
Examples:
Neighbourhood_A → 250000 (average house price £250k), Neighbourhood_B → 180000 (average £180k)Product_Electronics → 0.15 (15% return rate), Product_Clothing → 0.08 (8% return rate)Specific Models That Benefit Most:
Concrete Example: In the House Prices dataset (Ames, Iowa), neighbourhood names can be target-encoded with median house prices.
from sklearn.datasets import fetch_openml
ames = fetch_openml(name="house_prices", as_frame=True)
df = ames.frame
# Target encoding for neighbourhood
neighbourhood_means = df.groupby('Neighborhood')['SalePrice'].mean()
# Results: {'NoRidge': 335263, 'StoneBr': 310499, 'NridgHt': 307757...}
df['neighborhood_target_enc'] = df['Neighborhood'].map(neighbourhood_means)
Critical Warning: Target encoding must use cross-validation or leave-one-out to prevent data leakage. Never calculate target encoding on the same data you’re training on.
These methods convert categories to binary representations, offering a middle ground between single values and full one-hot encoding.
Categories are converted to binary representation, like computer bits.
Examples:
Red → [0,0,1], Blue → [0,1,0], Green → [0,1,1], Yellow → [1,0,0] (4 categories using 3 bits)Country_1 → [0,0,0,0,1], Country_2 → [0,0,0,1,0] (32 countries using 5 bits)Specific Models Where This Excels:
Concrete Example: The Adult Census dataset has ‘native-country’ with 41 unique values.
from category_encoders import BinaryEncoder
encoder = BinaryEncoder(cols=['native-country'])
# 41 countries encoded in 6 binary columns (2^6 = 64 > 41)
adult_encoded = encoder.fit_transform(adult)
Categories are mapped to fixed dimensions using hash functions.
Examples:
Any_Category → [0,1,0,0,1,0,1,0] (8 fixed dimensions regardless of category count)New_Unknown_Category → [1,0,1,0,0,1,0,1] (handles unseen categories automatically)Specific Models That Handle This Well:
Concrete Example: Text Classification with product categories where new categories appear regularly.
from category_encoders import HashingEncoder
categories = ['Electronics/Phones', 'Fashion/Shoes', 'Home/Kitchen', 'Unknown_New_Category_X']
encoder = HashingEncoder(cols=['category'], n_components=8)
# Any number of categories → always 8 dimensions
encoded_cats = encoder.fit_transform(pd.DataFrame({'category': categories}))
These methods create multi-dimensional vector representations for categories, allowing for richer feature relationships.
Each category becomes a binary vector with exactly one ‘1’ and all other positions ‘0’.
Examples:
Red → [1,0,0], Blue → [0,1,0], Green → [0,0,1]London → [1,0,0,0], Paris → [0,1,0,0], Berlin → [0,0,1,0], Rome → [0,0,0,1]Specific Models That Require This:
Concrete Example: Adult Census dataset workclass feature.
import pandas as pd
# One-hot encoding for workclass
workclass_onehot = pd.get_dummies(adult['workclass'], prefix='workclass')
# Creates columns: workclass_Private, workclass_Self-emp-not-inc, workclass_Self-emp-inc, etc.
Categories are mapped to learned dense vector representations where similar categories have similar vectors.
Examples:
iPhone → [0.2, -0.5, 0.8, 0.1], Samsung → [0.3, -0.4, 0.7, 0.2] (similar phones have similar vectors)Electronics → [0.1, 0.8, -0.2], Books → [-0.5, 0.3, 0.9] (different categories have different patterns)Specific Models and Architectures:
Concrete Example: MovieLens Recommendation where movie genres learn relationships.
import tensorflow as tf
# Product categories with hierarchy
vocab_size = 1000 # Number of unique categories
embedding_dim = 50
embedding_layer = tf.keras.layers.Embedding(
input_dim=vocab_size,
output_dim=embedding_dim,
input_length=1
)
# Similar categories learn similar embeddings
# 'Action Movies' and 'Adventure Movies' vectors become closer than 'Action' and 'Documentary'
Embeddings can be thought of as “learned one-hot” encoding where instead of fixed, sparse vectors, the model learns dense representations that capture semantic relationships between categories.
These methods are designed for specific statistical or domain requirements.
Categories are replaced with the log ratio of positive to negative events, commonly used in credit scoring.
Examples:
Good_Credit → 0.85 (ln(% of goods / % of bads)), Poor_Credit → -1.2High_Income → 1.1, Low_Income → -0.8 (for loan default prediction)Specific Models That Excel:
Similar to target encoding but excludes the current row when calculating the mean to prevent leakage.
Examples:
Category_A: calculate mean target excluding this specific rowCategory_A → 0.73 (mean of all other Category_A rows), prevents overfittingSpecific Models That Benefit:
Best Encodings: Target encoding, frequency encoding, label encoding
Best Encodings: Label encoding, ordinal encoding, binary encoding
Best Encodings: One-hot encoding, WoE encoding, target encoding with regularisation
Best Encodings: Embedding encoding, one-hot encoding, target encoding as additional features
Problem: Using target encoding without cross-validation creates data leakage.
Solution: Use category_encoders.TargetEncoder with proper CV or implement leave-one-out encoding.
Problem: 1000 categories create 1000 sparse columns. Solution: Use binary encoding (10 columns) or hash encoding (fixed size).
Problem: Treating ‘Electronics → Phones → iPhone’ as unrelated categories. Solution: Use embedding encoding or create hierarchical features.
Categorical encoding choice significantly impacts model performance. Start with simple methods (label encoding for tree models, one-hot for linear models), then experiment with advanced techniques like target encoding or embeddings for complex datasets. Always validate your choices with cross-validation and consider the specific characteristics of your data and model architecture.
The key is matching the encoding method to both your model’s assumptions and your data’s structure. When in doubt, try multiple methods and let empirical results guide your decision.