An outlier is a value unusually far from the bulk of the distribution. Some are data errors; some are rare but real events (fraud, viral post). Blind deletion can hide signal.
Detecting outliers
- Domain rules — age > 120, negative inventory
- Z-score / IQR — statistical distance from quartiles
- Visual — box plots, scatter plots (local matplotlib)
Error vs signal
Typos (extra zero in price) should be fixed or removed. Legitimate extremes (CEO salary in payroll export) may stay with robust methods (median, tree models) or winsorization (cap extremes).
Impact on models
- Linear regression and means are outlier-sensitive
- Tree-based models handle extremes differently
- Metrics like RMSE punish large errors heavily
Document decisions
Record which rows were capped, removed, or kept—and why. Stakeholders and auditors will ask.
Important interview questions and answers
- Q: IQR rule idea?
A: Values below Q1 − 1.5×IQR or above Q3 + 1.5×IQR are often flagged for review—not auto-deleted. - Q: Winsorization?
A: Cap extreme values at percentiles to limit influence without dropping rows.
Self-check
- Give one domain rule outlier example.
- Why not delete all statistical outliers?
- How do outliers affect mean vs median?
Tip: Domain experts validate whether extremes are errors or signal.
Interview prep
- IQR rule?
Values outside Q1-1.5*IQR or Q3+1.5*IQR flagged.
- Outlier always error?
Can be valid extreme events—ask domain experts.