Distributions concept

Last reviewed May 28, 2026 Content v20260528

Track mode

server_script

Means

Server runner

Reading

~2 min

Level

beginner

This lesson

This lesson teaches Distributions concept: the data science mindset, methods, and communication habits behind evidence-based decisions.

Teams apply Distributions concept in every serious Data Science project—skipping it leaves blind spots in analysis and reviews.

You will apply Distributions concept in contexts like: Analytics teams, product experimentation, research labs, and ML-adjacent engineering in every data-driven company.

Read the narrative, run Python in the playground (stdlib snippets now; install Jupyter, pandas, and scikit-learn locally for full notebooks), and complete MCQs to lock in vocabulary.

When you can explain the previous lesson's ideas in your own words.

A distribution describes how often values appear: for numbers, where most points cluster and how spread out they are; for categories, which labels dominate.

Center and spread

Mean — average; sensitive to extreme values
Median — middle value; robust when data are skewed
Standard deviation — typical distance from the mean
Quantiles — cut points (25th, 50th, 75th percentiles)

Revenue and session counts are often right-skewed: a few large values pull the mean above the median.

Shapes you will see

Symmetric — mean ≈ median (e.g. measurement noise)
Right-skewed — long tail of large values (income, clicks)
Bimodal — two peaks (two customer segments mixed)

Choosing mean vs median for reporting depends on shape and audience—not habit.

Categorical distributions

Bar charts of counts show category frequency. Watch class imbalance: 99% negatives makes accuracy misleading for fraud detection.

Stdlib preview

import statistics
values = [2, 3, 3, 7, 9, 11, 100]
print('mean:', round(statistics.mean(values), 2))
print('median:', statistics.median(values))

Install NumPy locally for histograms and vectorized stats on large arrays.

Important interview questions and answers

Q: Mean vs median when skewed?
A: Prefer median for skewed money or latency metrics; mean can mislead executives.
Q: What is class imbalance?
A: One label dominates the dataset—models may predict the majority class always.

Self-check

When is median more informative than mean?
What does right-skew mean?
Why does class imbalance affect accuracy?

Tip: Plot histograms locally with matplotlib after this track.

Interview prep

Skew?: Asymmetric tail—mean pulled toward extreme values.
Histogram?: Bin counts show shape of numeric variable.

Playground

Runs on the configured server runner (dev: npm run runner with LEARNING_RUNNER_ENABLED=true). Output appears below the editor.

Code runner not available

Server runner is disabled. Set LEARNING_RUNNER_ENABLED=true and LEARNING_RUNNER_URL in .env (see .env.example).

Discussion

Past discussion is visible to everyone. Only logged-in users can post comments and replies.

Starter discussion topics

Skewed data?
Histogram bins?

No discussion yet. Be the first to ask a question.