Exploratory data analysis (EDA) is detective work on a dataset before modeling: understand shape, spot errors, form hypotheses, and decide what to clean. EDA is iterative—each plot or summary may send you back to the data dictionary.
Goals of EDA
- Understand structure — rows, columns, types, keys
- Assess quality — missing values, duplicates, impossible ranges
- Summarize distributions — center, spread, skew
- Explore relationships — correlations, segments, time trends
- Generate hypotheses — what might predict the outcome?
EDA does not prove causation—it prepares trustworthy questions for modeling and stakeholders.
Typical EDA order
- Read the data dictionary and business context
- Count rows/columns; list column types
- Profile missingness and duplicates
- Summarize numeric columns (mean, median, quantiles)
- Tabulate categorical columns (counts, proportions)
- Plot or cross-tab relationships worth investigating
With Pandas locally, df.info(), df.describe(), and df.groupby() accelerate these steps; this track starts with Python lists and dicts in the playground.
Questions to write down
Before opening tools, answer on paper:
- What is one row? (user, order, session?)
- What is the target or KPI?
- What time range and filters apply?
- What would surprise a domain expert?
Connect to SQL
Many teams EDA in two layers: aggregate in SQL (counts, daily rollups), then load a sample into Python for deeper stats. The warehouse answers “how big”; the notebook answers “what pattern.”
Important interview questions and answers
- Q: EDA vs modeling?
A: EDA explores and questions data; modeling fits patterns to predict or explain with stated assumptions. - Q: Why EDA before cleaning?
A: You cannot impute or drop wisely until you know how missingness and outliers are distributed.
Self-check
- Name three goals of EDA.
- Why document the unit of observation before plotting?
- Where might SQL fit in an EDA workflow?
Tip: EDA is iterative—expect to revisit cleaning after plots.
Interview prep
- EDA goal?
Understand data before modeling; find errors and hypotheses.
- EDA proves causation?
No—suggests relationships to test carefully.