Pandas performance comes from vectorization and smart dtypes. Avoid Python row loops, use categoricals for repeated strings, and push heavy work to NumPy or query engines when data outgrows memory.
Do
- Vectorize with NumPy ufuncs and boolean masks
- Use
categoryfor low-cardinality strings - Select columns early—don't carry unused wide frames
- Use
queryandevalfor large frames (optional numexpr) - Read Parquet instead of CSV for repeated loads
Avoid
iterrows()andapply(axis=1)on large data- Repeated concat in loops—collect list then one concat
- Chained indexing causing copies
- Object dtype when numeric/category suffices
When to leave Pandas
Millions+ rows, complex joins, or cluster compute → SQL, Polars, DuckDB, or Spark. Pandas remains ideal for notebook-scale EDA.
Important interview questions and answers
- Q: iterrows cost?
A: Returns each row as Series—Python overhead per row; use vectorization or itertuples if must loop. - Q: Parquet benefit?
A: Columnar, typed, compressed—faster IO and smaller files than CSV.
Self-check
- Name three Pandas performance best practices.
- When would you use SQL instead of Pandas?
Pitfall: iterrows() on millions of rows—profile vectorized alternatives first.
Interview prep
- iterrows?
Slow—returns Series per row with Python overhead.
- When leave Pandas?
Data exceeds RAM—SQL, Polars, DuckDB, Spark.
- Parquet?
Faster typed IO than CSV for pipeline artifacts.