Performance tips

Last reviewed May 28, 2026 Content v20260528

Track mode: server_script
Means: Server runner
Reading: ~2 min
Level: advanced

This lesson

This lesson teaches Performance tips: Pandas tabular manipulation—indexing, dtypes, reshaping, and analysis habits for real-world tables.

Teams apply Performance tips in every serious Pandas project—skipping it leaves blind spots in analysis and reviews.

You will apply Performance tips in contexts like: CSV/Parquet analysis, ETL notebooks, and ad hoc reporting.

Read the narrative, run `import pandas as pd` snippets with in-memory DataFrames (install pandas and numpy with pip if needed), inspect `.head()`, `.dtypes`, and complete MCQs.

When basics, filtering, groupby, and merges from intermediate lessons are comfortable in the playground.

Pandas performance comes from vectorization and smart dtypes. Avoid Python row loops, use categoricals for repeated strings, and push heavy work to NumPy or query engines when data outgrows memory.

Do

Vectorize with NumPy ufuncs and boolean masks
Use category for low-cardinality strings
Select columns early—don't carry unused wide frames
Use query and eval for large frames (optional numexpr)
Read Parquet instead of CSV for repeated loads

Avoid

iterrows() and apply(axis=1) on large data
Repeated concat in loops—collect list then one concat
Chained indexing causing copies
Object dtype when numeric/category suffices

When to leave Pandas

Millions+ rows, complex joins, or cluster compute → SQL, Polars, DuckDB, or Spark. Pandas remains ideal for notebook-scale EDA.

Important interview questions and answers

Q: iterrows cost?
A: Returns each row as Series—Python overhead per row; use vectorization or itertuples if must loop.
Q: Parquet benefit?
A: Columnar, typed, compressed—faster IO and smaller files than CSV.

Self-check

Name three Pandas performance best practices.
When would you use SQL instead of Pandas?

Pitfall: iterrows() on millions of rows—profile vectorized alternatives first.

Interview prep

iterrows?: Slow—returns Series per row with Python overhead.
When leave Pandas?: Data exceeds RAM—SQL, Polars, DuckDB, Spark.
Parquet?: Faster typed IO than CSV for pipeline artifacts.

Interview tip Lesson completion confidence

Can you explain this lesson in 30 seconds without reading notes?

Self-reflection (saved on this device)

Not saved yet.

Playground

Runs on the configured server runner (dev: npm run runner with LEARNING_RUNNER_ENABLED=true). Output appears below the editor.

Code runner not available

Server runner is disabled. Set LEARNING_RUNNER_ENABLED=true and LEARNING_RUNNER_URL in .env (see .env.example).

Check yourself

Multiple choice — immediate feedback.

Discussion

Past discussion is visible to everyone. Only logged-in users can post comments and replies.

Starter discussion topics

Vectorize not apply?
category dtypes?

No discussion yet. Be the first to ask a question.