Handling missing values

Last reviewed May 28, 2026 Content v20260528

Track mode: server_script
Means: Server runner
Reading: ~2 min
Level: beginner

This lesson

This lesson teaches Handling missing values: the data science mindset, methods, and communication habits behind evidence-based decisions.

Missing data mechanisms (MCAR/MAR/MNAR) decide whether imputation is safe—blind fill creates false confidence.

You will apply Handling missing values in contexts like: Messy CSV exports, API logs, and survey data before any dashboard ships.

Read the narrative, run Python in the playground (stdlib snippets now; install Jupyter, pandas, and scikit-learn locally for full notebooks), and complete MCQs to lock in vocabulary.

When you can explain the previous lesson's ideas in your own words.

After auditing missingness, choose a strategy per column: drop, impute, or model missingness explicitly. The right choice depends on how much is missing and why.

When to drop rows

Very few rows missing and no pattern tied to target
Critical identifier missing (cannot join or attribute)

Dropping many rows can bias results if missingness is not random.

Imputation options

Numeric — median (robust), mean, or group-wise median by category
Categorical — mode (most frequent) or explicit “unknown”
Advanced — model-based imputation (use with care, fit on train only)

Indicators

Add revenue_was_missing flag columns when missingness may carry signal (optional survey questions, partial form completion).

Train-only fitting

# Conceptual pattern (after split):
# median_age = statistics.median(train_ages)
# for row in train: impute with median_age
# for row in test: use same median_age from train

Important interview questions and answers

Q: Why median for skewed numeric?
A: Less pulled by outliers than mean—common default for imputation.
Q: Impute on full dataset risk?
A: Test information leaks into training via global statistics—inflate metrics.

Self-check

Name two imputation strategies for categoricals.
When is dropping rows reasonable?
Why fit imputation on training data only?

Tip: Document imputation strategy in the README.

Interview prep

Drop vs impute?: Drop when few rows; impute with care and documentation.

Interview tip Lesson completion confidence

Can you explain this lesson in 30 seconds without reading notes?

Self-reflection (saved on this device)

Not saved yet.

Playground

Runs on the configured server runner (dev: npm run runner with LEARNING_RUNNER_ENABLED=true). Output appears below the editor.

Code runner not available

Server runner is disabled. Set LEARNING_RUNNER_ENABLED=true and LEARNING_RUNNER_URL in .env (see .env.example).

Check yourself

Multiple choice — immediate feedback.

Discussion

Past discussion is visible to everyone. Only logged-in users can post comments and replies.

Starter discussion topics

Drop vs fill?
MNAR thought?

No discussion yet. Be the first to ask a question.