Production data science leans on NumPy for fast arrays and Pandas for labeled tables. This lesson ties stdlib stats to what those libraries accelerate—install them locally after this track’s workflow lessons.
Division of labor
- NumPy — ndarray, vectorized math, linear algebra hooks
- Pandas — DataFrame, CSV/Parquet IO, groupby, merge, time series
- SciPy / sklearn — statistics and ML (install locally)
Stdlib bridge (runnable)
Same session counts as earlier EDA lessons—computed with statistics and list comprehensions. NumPy would use np.array(sessions).mean() on millions of rows without Python loops.
Pandas equivalents (local)
import pandas as pd
df = pd.DataFrame(rows)
print(df.describe())
print(df.groupby('country')['sessions'].median())
Learning path
Recommended order after this track:
Important interview questions and answers
- Q: Why NumPy?
A: C-backed contiguous arrays—orders of magnitude faster than pure Python loops on numeric data. - Q: DataFrame vs list of dicts?
A: DataFrame adds column indexes, alignment, and IO—same rows, richer API.
Self-check
- What does NumPy optimize?
- What pandas function summarizes numeric columns?
- Name two topics to study after this track.
Tip: Continue on NumPy then Pandas tracks.
Interview prep
- NumPy role?
Fast ndarray math; foundation of pandas.
- pandas role?
Tabular DataFrame operations.