Machine learning models need numbers. Categorical encoding maps labels like country=IN or plan=premium into numeric representations models can use.
Common encodings
- One-hot — binary column per category (watch high cardinality)
- Ordinal — integers for ordered levels (low < medium < high)
- Target encoding — mean target per category (risky leakage—advanced)
Cardinality trap
User IDs as categories explode feature count and memorize training noise. Aggregate to higher-level features (region, signup cohort) instead.
Pandas preview (local)
# import pandas as pd
# pd.get_dummies(df['plan'], prefix='plan')See Pandas for get_dummies; scikit-learn offers OneHotEncoder in pipelines locally.
Unknown categories at scoring time
Production models see new labels. Pipelines should map unknowns to an “other” bucket defined during training—not crash.
Important interview questions and answers
- Q: One-hot encoding?
A: Each category becomes its own 0/1 feature column—default for unordered nominals. - Q: Why avoid user_id as feature?
A: Extreme cardinality—model memorizes individuals, fails on new users.
Self-check
- What is one-hot encoding?
- When is ordinal encoding appropriate?
- What happens if production sees a new category?
Tip: One-hot explode wide tables—watch cardinality.
Interview prep
- One-hot?
Binary column per category for many ML models.