Bay Information Systems

A Complete Guide to Categorical Encoding for Machine Learning

Categorical encoding remains one of the most crucial preprocessing steps in machine learning, yet choosing the right method often feels like guesswork. This guide provides specific model recommendations, concrete examples, and practical datasets to demonstrate when and why each encoding method works best.

Understanding the Problem

Most machine learning algorithms require numerical input, but real-world data contains categorical variables like product types, geographical regions, or customer segments. The encoding method you choose can dramatically impact model performance, sometimes by 10-20% or more.

Single-Value Encodings

These methods convert each category to a single numerical value, making them memory-efficient and suitable for tree-based models.

Label/Integer Encoding

Categories are mapped to arbitrary integers without implying order.

Examples:

Specific Models That Excel:

Concrete Example: In the Titanic dataset, embarked ports can be label encoded.

import pandas as pd
from sklearn.datasets import fetch_openml

titanic = fetch_openml('titanic', version=1, as_frame=True)
df = titanic.frame

# Label encoding for 'embarked' column
embarked_mapping = {'S': 0, 'C': 1, 'Q': 2}
df['embarked_label'] = df['embarked'].map(embarked_mapping)

Ordinal Encoding

Categories are mapped to integers that reflect meaningful order relationships.

Examples:

Specific Models That Leverage Order:

Concrete Example: Adult Census dataset education feature.

education_order = {
    'Preschool': 1, '1st-4th': 2, '5th-6th': 3, '7th-8th': 4, '9th': 5,
    '10th': 6, '11th': 7, '12th': 8, 'HS-grad': 9, 'Some-college': 10,
    'Assoc-voc': 11, 'Assoc-acdm': 12, 'Bachelors': 13, 'Masters': 14, 
    'Prof-school': 15, 'Doctorate': 16
}

adult['education_ordinal'] = adult['education'].map(education_order)

Frequency Encoding

Categories are replaced with their occurrence frequency in the dataset.

Examples:

Specific Models That Excel:

Concrete Example: In the Titanic dataset, passenger class frequency correlates with survival rates.

freq_map = df['embarked'].value_counts().to_dict()
# Results: {'S': 644, 'C': 168, 'Q': 77}
df['embarked_freq'] = df['embarked'].map(freq_map)

Target Encoding

Categories are replaced with the average target value for that category across the training set.

Examples:

Specific Models That Benefit Most:

Concrete Example: In the House Prices dataset (Ames, Iowa), neighbourhood names can be target-encoded with median house prices.

from sklearn.datasets import fetch_openml

ames = fetch_openml(name="house_prices", as_frame=True)
df = ames.frame

# Target encoding for neighbourhood
neighbourhood_means = df.groupby('Neighborhood')['SalePrice'].mean()
# Results: {'NoRidge': 335263, 'StoneBr': 310499, 'NridgHt': 307757...}
df['neighborhood_target_enc'] = df['Neighborhood'].map(neighbourhood_means)

Critical Warning: Target encoding must use cross-validation or leave-one-out to prevent data leakage. Never calculate target encoding on the same data you’re training on.

Compressed Binary Encodings

These methods convert categories to binary representations, offering a middle ground between single values and full one-hot encoding.

Binary Encoding

Categories are converted to binary representation, like computer bits.

Examples:

Specific Models Where This Excels:

Concrete Example: The Adult Census dataset has ‘native-country’ with 41 unique values.

from category_encoders import BinaryEncoder

encoder = BinaryEncoder(cols=['native-country'])
# 41 countries encoded in 6 binary columns (2^6 = 64 > 41)
adult_encoded = encoder.fit_transform(adult)

Hash Encoding

Categories are mapped to fixed dimensions using hash functions.

Examples:

Specific Models That Handle This Well:

Concrete Example: Text Classification with product categories where new categories appear regularly.

from category_encoders import HashingEncoder

categories = ['Electronics/Phones', 'Fashion/Shoes', 'Home/Kitchen', 'Unknown_New_Category_X']

encoder = HashingEncoder(cols=['category'], n_components=8)
# Any number of categories → always 8 dimensions
encoded_cats = encoder.fit_transform(pd.DataFrame({'category': categories}))

Vector Representations

These methods create multi-dimensional vector representations for categories, allowing for richer feature relationships.

One-Hot Encoding

Each category becomes a binary vector with exactly one ‘1’ and all other positions ‘0’.

Examples:

Specific Models That Require This:

Concrete Example: Adult Census dataset workclass feature.

import pandas as pd

# One-hot encoding for workclass
workclass_onehot = pd.get_dummies(adult['workclass'], prefix='workclass')
# Creates columns: workclass_Private, workclass_Self-emp-not-inc, workclass_Self-emp-inc, etc.

Embedding Encoding

Categories are mapped to learned dense vector representations where similar categories have similar vectors.

Examples:

Specific Models and Architectures:

Concrete Example: MovieLens Recommendation where movie genres learn relationships.

import tensorflow as tf

# Product categories with hierarchy
vocab_size = 1000  # Number of unique categories
embedding_dim = 50

embedding_layer = tf.keras.layers.Embedding(
    input_dim=vocab_size,
    output_dim=embedding_dim,
    input_length=1
)

# Similar categories learn similar embeddings
# 'Action Movies' and 'Adventure Movies' vectors become closer than 'Action' and 'Documentary'

Embeddings can be thought of as “learned one-hot” encoding where instead of fixed, sparse vectors, the model learns dense representations that capture semantic relationships between categories.

Specialised Encodings

These methods are designed for specific statistical or domain requirements.

Weight of Evidence (WoE) Encoding

Categories are replaced with the log ratio of positive to negative events, commonly used in credit scoring.

Examples:

Specific Models That Excel:

Leave-One-Out Encoding

Similar to target encoding but excludes the current row when calculating the mean to prevent leakage.

Examples:

Specific Models That Benefit:

Model-Specific Recommendations

Gradient Boosting (XGBoost, LightGBM)

Best Encodings: Target encoding, frequency encoding, label encoding

Random Forest

Best Encodings: Label encoding, ordinal encoding, binary encoding

Linear/Logistic Regression

Best Encodings: One-hot encoding, WoE encoding, target encoding with regularisation

Neural Networks

Best Encodings: Embedding encoding, one-hot encoding, target encoding as additional features

Common Pitfalls and Solutions

Target Encoding Overfitting

Problem: Using target encoding without cross-validation creates data leakage. Solution: Use category_encoders.TargetEncoder with proper CV or implement leave-one-out encoding.

High Cardinality One-Hot Explosion

Problem: 1000 categories create 1000 sparse columns. Solution: Use binary encoding (10 columns) or hash encoding (fixed size).

Ignoring Category Hierarchy

Problem: Treating ‘Electronics → Phones → iPhone’ as unrelated categories. Solution: Use embedding encoding or create hierarchical features.

Conclusion

Categorical encoding choice significantly impacts model performance. Start with simple methods (label encoding for tree models, one-hot for linear models), then experiment with advanced techniques like target encoding or embeddings for complex datasets. Always validate your choices with cross-validation and consider the specific characteristics of your data and model architecture.

The key is matching the encoding method to both your model’s assumptions and your data’s structure. When in doubt, try multiple methods and let empirical results guide your decision.