Skip to content
Learn Netverks

Lesson

Step 27/36 75% through track

performance-tips

Performance tips

Last reviewed May 28, 2026 Content v20260528
Track mode
server_script
Means
Server runner
Reading
~2 min
Level
advanced

This lesson

This lesson teaches Performance tips: Pandas tabular manipulation—indexing, dtypes, reshaping, and analysis habits for real-world tables.

Teams apply Performance tips in every serious Pandas project—skipping it leaves blind spots in analysis and reviews.

You will apply Performance tips in contexts like: CSV/Parquet analysis, ETL notebooks, and ad hoc reporting.

Read the narrative, run `import pandas as pd` snippets with in-memory DataFrames (install pandas and numpy with pip if needed), inspect `.head()`, `.dtypes`, and complete MCQs.

When basics, filtering, groupby, and merges from intermediate lessons are comfortable in the playground.

Pandas performance comes from vectorization and smart dtypes. Avoid Python row loops, use categoricals for repeated strings, and push heavy work to NumPy or query engines when data outgrows memory.

Do

  • Vectorize with NumPy ufuncs and boolean masks
  • Use category for low-cardinality strings
  • Select columns early—don't carry unused wide frames
  • Use query and eval for large frames (optional numexpr)
  • Read Parquet instead of CSV for repeated loads

Avoid

  • iterrows() and apply(axis=1) on large data
  • Repeated concat in loops—collect list then one concat
  • Chained indexing causing copies
  • Object dtype when numeric/category suffices

When to leave Pandas

Millions+ rows, complex joins, or cluster compute → SQL, Polars, DuckDB, or Spark. Pandas remains ideal for notebook-scale EDA.

Important interview questions and answers

  1. Q: iterrows cost?
    A: Returns each row as Series—Python overhead per row; use vectorization or itertuples if must loop.
  2. Q: Parquet benefit?
    A: Columnar, typed, compressed—faster IO and smaller files than CSV.

Self-check

  1. Name three Pandas performance best practices.
  2. When would you use SQL instead of Pandas?

Pitfall: iterrows() on millions of rows—profile vectorized alternatives first.

Interview prep

iterrows?

Slow—returns Series per row with Python overhead.

When leave Pandas?

Data exceeds RAM—SQL, Polars, DuckDB, Spark.

Parquet?

Faster typed IO than CSV for pipeline artifacts.

Interview tip Lesson completion confidence

Can you explain this lesson in 30 seconds without reading notes?

Not saved yet.

Playground

Runs on the configured server runner (dev: npm run runner with LEARNING_RUNNER_ENABLED=true). Output appears below the editor.

Check yourself

Multiple choice — immediate feedback.

Discussion

Past discussion is visible to everyone. Only logged-in users can post comments and replies.

Starter discussion topics

  • Vectorize not apply?
  • category dtypes?

Sign up or log in to post comments and sync lesson progress across devices.

No discussion yet. Be the first to ask a question.

Jump