Bay Information Systems

A Complete Guide to Categorical Encoding for Machine Learning

Categorical encoding remains one of the most crucial preprocessing steps in machine learning, yet choosing the right method often feels like guesswork. This guide provides specific model recommendations, concrete examples, and practical datasets to demonstrate when and why each encoding method works best.

Understanding the Problem

Most machine learning algorithms require numerical input, but real-world data contains categorical variables like product types, geographical regions, or customer segments. The encoding method you choose can dramatically impact model performance, sometimes by 10-20% or more.

Single-Value Encodings

These methods convert each category to a single numerical value, making them memory-efficient and suitable for tree-based models.

Label/Integer Encoding

Categories are mapped to arbitrary integers without implying order.

Examples:

red → 1, blue → 2, green → 3
London → 0, Paris → 1, Berlin → 2

Specific Models That Excel:

XGBoost and LightGBM: Can split on any integer value effectively
Random Forest: Treats each split as categorical boundary
CatBoost: Has built-in categorical handling that works with label encoding

Concrete Example: In the Titanic dataset, embarked ports can be label encoded.

import pandas as pd
from sklearn.datasets import fetch_openml

titanic = fetch_openml('titanic', version=1, as_frame=True)
df = titanic.frame

# Label encoding for 'embarked' column
embarked_mapping = {'S': 0, 'C': 1, 'Q': 2}
df['embarked_label'] = df['embarked'].map(embarked_mapping)

Ordinal Encoding

Categories are mapped to integers that reflect meaningful order relationships.

Examples:

Small → 1, Medium → 2, Large → 3, XL → 4
Poor → 1, Fair → 2, Good → 3, Excellent → 4

Specific Models That Leverage Order:

Linear Regression: Assumes linear relationships between encoded values
Gradient Boosting: Creates more meaningful splits (e.g., “education ≥ 3”)
Neural Networks: Can learn non-linear transformations of ordinal relationships

Concrete Example: Adult Census dataset education feature.

education_order = {
    'Preschool': 1, '1st-4th': 2, '5th-6th': 3, '7th-8th': 4, '9th': 5,
    '10th': 6, '11th': 7, '12th': 8, 'HS-grad': 9, 'Some-college': 10,
    'Assoc-voc': 11, 'Assoc-acdm': 12, 'Bachelors': 13, 'Masters': 14, 
    'Prof-school': 15, 'Doctorate': 16
}

adult['education_ordinal'] = adult['education'].map(education_order)

Frequency Encoding

Categories are replaced with their occurrence frequency in the dataset.

Examples:

Apple → 1000 (appears 1000 times), Orange → 500 (appears 500 times)
Manager → 50, Developer → 200 (job titles by frequency)

Specific Models That Excel:

XGBoost and LightGBM: Can create splits like “if frequency > 500, then…”
Linear Regression with Polynomial Features: Frequency often correlates with target variables
Random Forest: Can use frequency as a ranking signal in tree splits

Concrete Example: In the Titanic dataset, passenger class frequency correlates with survival rates.

freq_map = df['embarked'].value_counts().to_dict()
# Results: {'S': 644, 'C': 168, 'Q': 77}
df['embarked_freq'] = df['embarked'].map(freq_map)

Target Encoding

Categories are replaced with the average target value for that category across the training set.

Examples:

Neighbourhood_A → 250000 (average house price £250k), Neighbourhood_B → 180000 (average £180k)
Product_Electronics → 0.15 (15% return rate), Product_Clothing → 0.08 (8% return rate)

Specific Models That Benefit Most:

Gradient Boosting Machines (XGBoost, LightGBM, CatBoost): Can directly split on target-encoded values
Linear and Logistic Regression: Creates direct linear relationships with target variable
Neural Networks: Provides strong signal when used alongside other features

Concrete Example: In the House Prices dataset (Ames, Iowa), neighbourhood names can be target-encoded with median house prices.

from sklearn.datasets import fetch_openml

ames = fetch_openml(name="house_prices", as_frame=True)
df = ames.frame

# Target encoding for neighbourhood
neighbourhood_means = df.groupby('Neighborhood')['SalePrice'].mean()
# Results: {'NoRidge': 335263, 'StoneBr': 310499, 'NridgHt': 307757...}
df['neighborhood_target_enc'] = df['Neighborhood'].map(neighbourhood_means)

Critical Warning: Target encoding must use cross-validation or leave-one-out to prevent data leakage. Never calculate target encoding on the same data you’re training on.

Compressed Binary Encodings

These methods convert categories to binary representations, offering a middle ground between single values and full one-hot encoding.

Binary Encoding

Categories are converted to binary representation, like computer bits.

Examples:

Red → [0,0,1], Blue → [0,1,0], Green → [0,1,1], Yellow → [1,0,0] (4 categories using 3 bits)
Country_1 → [0,0,0,0,1], Country_2 → [0,0,0,1,0] (32 countries using 5 bits)

Specific Models Where This Excels:

Random Forest and Extra Trees: Handle binary features efficiently
Neural Networks: Fewer parameters than one-hot encoding
Support Vector Machines: Reduces dimensionality compared to one-hot

Concrete Example: The Adult Census dataset has ‘native-country’ with 41 unique values.

from category_encoders import BinaryEncoder

encoder = BinaryEncoder(cols=['native-country'])
# 41 countries encoded in 6 binary columns (2^6 = 64 > 41)
adult_encoded = encoder.fit_transform(adult)

Hash Encoding

Categories are mapped to fixed dimensions using hash functions.

Examples:

Any_Category → [0,1,0,0,1,0,1,0] (8 fixed dimensions regardless of category count)
New_Unknown_Category → [1,0,1,0,0,1,0,1] (handles unseen categories automatically)

Specific Models That Handle This Well:

Online Learning Models (SGD, Passive-Aggressive): Can process streaming data with new categories
Vowpal Wabbit: Designed for hash-based feature encoding
Neural Networks with Embedding Layers: Can learn from hash collisions

Concrete Example: Text Classification with product categories where new categories appear regularly.

from category_encoders import HashingEncoder

categories = ['Electronics/Phones', 'Fashion/Shoes', 'Home/Kitchen', 'Unknown_New_Category_X']

encoder = HashingEncoder(cols=['category'], n_components=8)
# Any number of categories → always 8 dimensions
encoded_cats = encoder.fit_transform(pd.DataFrame({'category': categories}))

Vector Representations

These methods create multi-dimensional vector representations for categories, allowing for richer feature relationships.

One-Hot Encoding

Each category becomes a binary vector with exactly one ‘1’ and all other positions ‘0’.

Examples:

Red → [1,0,0], Blue → [0,1,0], Green → [0,0,1]
London → [1,0,0,0], Paris → [0,1,0,0], Berlin → [0,0,1,0], Rome → [0,0,0,1]

Specific Models That Require This:

Linear and Logistic Regression: Needs explicit feature representation for each category
Support Vector Machines: Cannot interpret arbitrary label encodings meaningfully
Naive Bayes: Works well with binary features

Concrete Example: Adult Census dataset workclass feature.

import pandas as pd

# One-hot encoding for workclass
workclass_onehot = pd.get_dummies(adult['workclass'], prefix='workclass')
# Creates columns: workclass_Private, workclass_Self-emp-not-inc, workclass_Self-emp-inc, etc.

Embedding Encoding

Categories are mapped to learned dense vector representations where similar categories have similar vectors.

Examples:

iPhone → [0.2, -0.5, 0.8, 0.1], Samsung → [0.3, -0.4, 0.7, 0.2] (similar phones have similar vectors)
Electronics → [0.1, 0.8, -0.2], Books → [-0.5, 0.3, 0.9] (different categories have different patterns)

Specific Models and Architectures:

TensorFlow/Keras Embedding Layers: For deep learning models
Entity2Vec: For recommendation systems
PyTorch Embedding: For categories with complex relationships

Concrete Example: MovieLens Recommendation where movie genres learn relationships.

import tensorflow as tf

# Product categories with hierarchy
vocab_size = 1000  # Number of unique categories
embedding_dim = 50

embedding_layer = tf.keras.layers.Embedding(
    input_dim=vocab_size,
    output_dim=embedding_dim,
    input_length=1
)

# Similar categories learn similar embeddings
# 'Action Movies' and 'Adventure Movies' vectors become closer than 'Action' and 'Documentary'

Embeddings can be thought of as “learned one-hot” encoding where instead of fixed, sparse vectors, the model learns dense representations that capture semantic relationships between categories.

Specialised Encodings

These methods are designed for specific statistical or domain requirements.

Weight of Evidence (WoE) Encoding

Categories are replaced with the log ratio of positive to negative events, commonly used in credit scoring.

Examples:

Good_Credit → 0.85 (ln(% of goods / % of bads)), Poor_Credit → -1.2
High_Income → 1.1, Low_Income → -0.8 (for loan default prediction)

Specific Models That Excel:

Logistic Regression: WoE creates linear relationship with log-odds
Credit Scoring Models: Industry standard for risk assessment
Binary Classification: Optimised for binary outcomes

Leave-One-Out Encoding

Similar to target encoding but excludes the current row when calculating the mean to prevent leakage.

Examples:

For row with Category_A: calculate mean target excluding this specific row
Category_A → 0.73 (mean of all other Category_A rows), prevents overfitting

Specific Models That Benefit:

Gradient Boosting: When target encoding shows overfitting
Competition Settings: Where overfitting is heavily penalised
Small Datasets: Where cross-validation is impractical

Model-Specific Recommendations

Gradient Boosting (XGBoost, LightGBM)

Best Encodings: Target encoding, frequency encoding, label encoding

Random Forest

Best Encodings: Label encoding, ordinal encoding, binary encoding

Linear/Logistic Regression

Best Encodings: One-hot encoding, WoE encoding, target encoding with regularisation

Neural Networks

Best Encodings: Embedding encoding, one-hot encoding, target encoding as additional features

Common Pitfalls and Solutions

Target Encoding Overfitting

Problem: Using target encoding without cross-validation creates data leakage. Solution: Use category_encoders.TargetEncoder with proper CV or implement leave-one-out encoding.

High Cardinality One-Hot Explosion

Problem: 1000 categories create 1000 sparse columns. Solution: Use binary encoding (10 columns) or hash encoding (fixed size).

Ignoring Category Hierarchy

Problem: Treating ‘Electronics → Phones → iPhone’ as unrelated categories. Solution: Use embedding encoding or create hierarchical features.

Conclusion

Categorical encoding choice significantly impacts model performance. Start with simple methods (label encoding for tree models, one-hot for linear models), then experiment with advanced techniques like target encoding or embeddings for complex datasets. Always validate your choices with cross-validation and consider the specific characteristics of your data and model architecture.

The key is matching the encoding method to both your model’s assumptions and your data’s structure. When in doubt, try multiple methods and let empirical results guide your decision.