Exploratory data analysis introduction

Last reviewed May 28, 2026 Content v20260528

Track mode

server_script

Means

Server runner

Reading

~2 min

Level

beginner

This lesson

An orientation to the Data Science track—workflow, ethics, Python playground practice, and links to NumPy/Pandas next.

You need a clear map of the Data Science lifecycle so exploration, leakage, and stakeholder communication do not feel like ad hoc guessing.

You will apply Exploratory data analysis introduction in contexts like: Analytics teams, product experimentation, research labs, and ML-adjacent engineering in every data-driven company.

Read the narrative, run Python in the playground (stdlib snippets now; install Jupyter, pandas, and scikit-learn locally for full notebooks), and complete MCQs to lock in vocabulary. Also read the interview prep blocks; write one measurable question for a dataset you care about.

After /python/intro basics and ideally some /sql/intro—before deep NumPy/Pandas specialization.

Exploratory data analysis (EDA) is detective work on a dataset before modeling: understand shape, spot errors, form hypotheses, and decide what to clean. EDA is iterative—each plot or summary may send you back to the data dictionary.

Goals of EDA

Understand structure — rows, columns, types, keys
Assess quality — missing values, duplicates, impossible ranges
Summarize distributions — center, spread, skew
Explore relationships — correlations, segments, time trends
Generate hypotheses — what might predict the outcome?

EDA does not prove causation—it prepares trustworthy questions for modeling and stakeholders.

Typical EDA order

Read the data dictionary and business context
Count rows/columns; list column types
Profile missingness and duplicates
Summarize numeric columns (mean, median, quantiles)
Tabulate categorical columns (counts, proportions)
Plot or cross-tab relationships worth investigating

With Pandas locally, df.info(), df.describe(), and df.groupby() accelerate these steps; this track starts with Python lists and dicts in the playground.

Questions to write down

Before opening tools, answer on paper:

What is one row? (user, order, session?)
What is the target or KPI?
What time range and filters apply?
What would surprise a domain expert?

Connect to SQL

Many teams EDA in two layers: aggregate in SQL (counts, daily rollups), then load a sample into Python for deeper stats. The warehouse answers “how big”; the notebook answers “what pattern.”

Important interview questions and answers

Q: EDA vs modeling?
A: EDA explores and questions data; modeling fits patterns to predict or explain with stated assumptions.
Q: Why EDA before cleaning?
A: You cannot impute or drop wisely until you know how missingness and outliers are distributed.

Self-check

Name three goals of EDA.
Why document the unit of observation before plotting?
Where might SQL fit in an EDA workflow?

Tip: EDA is iterative—expect to revisit cleaning after plots.

Interview prep

EDA goal?: Understand data before modeling; find errors and hypotheses.
EDA proves causation?: No—suggests relationships to test carefully.

Playground

Runs on the configured server runner (dev: npm run runner with LEARNING_RUNNER_ENABLED=true). Output appears below the editor.

Code runner not available

Server runner is disabled. Set LEARNING_RUNNER_ENABLED=true and LEARNING_RUNNER_URL in .env (see .env.example).

Discussion

Past discussion is visible to everyone. Only logged-in users can post comments and replies.

Starter discussion topics

EDA order?
EDA vs modeling?

No discussion yet. Be the first to ask a question.