Scaling puts numeric features on comparable ranges so distance-based models (k-NN, SVM, neural nets, regularized regression) train fairly. Tree models often do not require scaling.
Standardization vs normalization
- Standardization (z-score) — subtract mean, divide by std → roughly mean 0, std 1
- Min-max scaling — squeeze to [0, 1] using min and max
- Robust scaling — use median and IQR when outliers present
Fit on train, apply to test
Compute scaling parameters from training data only, then transform validation and test with those same parameters—another leakage guardrail.
When it matters less
Random forests and gradient boosted trees split on thresholds—they are scale-invariant for many setups. Still scale when mixing model types in one pipeline.
NumPy connection
Vectorized scaling is fast with NumPy arrays; sklearn StandardScaler wraps this in production pipelines locally.
Important interview questions and answers
- Q: Why scale for k-NN?
A: Features with large units dominate distance—age in years vs income in thousands. - Q: Leakage via scaling?
A: Computing mean/std on full data before split leaks test distribution into train.
Self-check
- What is z-score standardization?
- Which model families often need scaling?
- Why fit the scaler on training data only?
Tip: Fit scaler on training data only.
Interview prep
- Why scale?
Features on different units can dominate distance-based models.