A distribution describes how often values appear: for numbers, where most points cluster and how spread out they are; for categories, which labels dominate.
Center and spread
- Mean — average; sensitive to extreme values
- Median — middle value; robust when data are skewed
- Standard deviation — typical distance from the mean
- Quantiles — cut points (25th, 50th, 75th percentiles)
Revenue and session counts are often right-skewed: a few large values pull the mean above the median.
Shapes you will see
- Symmetric — mean ≈ median (e.g. measurement noise)
- Right-skewed — long tail of large values (income, clicks)
- Bimodal — two peaks (two customer segments mixed)
Choosing mean vs median for reporting depends on shape and audience—not habit.
Categorical distributions
Bar charts of counts show category frequency. Watch class imbalance: 99% negatives makes accuracy misleading for fraud detection.
Stdlib preview
import statistics
values = [2, 3, 3, 7, 9, 11, 100]
print('mean:', round(statistics.mean(values), 2))
print('median:', statistics.median(values))Install NumPy locally for histograms and vectorized stats on large arrays.
Important interview questions and answers
- Q: Mean vs median when skewed?
A: Prefer median for skewed money or latency metrics; mean can mislead executives. - Q: What is class imbalance?
A: One label dominates the dataset—models may predict the majority class always.
Self-check
- When is median more informative than mean?
- What does right-skew mean?
- Why does class imbalance affect accuracy?
Tip: Plot histograms locally with matplotlib after this track.
Interview prep
- Skew?
Asymmetric tail—mean pulled toward extreme values.
- Histogram?
Bin counts show shape of numeric variable.