Duplicate handling

Last reviewed May 28, 2026 Content v20260528

Track mode: server_script
Means: Server runner
Reading: ~1 min
Level: intermediate

This lesson

This lesson teaches Duplicate handling: Pandas tabular manipulation—indexing, dtypes, reshaping, and analysis habits for real-world tables.

Teams apply Duplicate handling in every serious Pandas project—skipping it leaves blind spots in analysis and reviews.

You will apply Duplicate handling in contexts like: CSV/Parquet analysis, ETL notebooks, and ad hoc reporting.

Read the narrative, run `import pandas as pd` snippets with in-memory DataFrames (install pandas and numpy with pip if needed), inspect `.head()`, `.dtypes`, and complete MCQs.

When you can explain the previous lesson's ideas in your own words.

Find duplicates with duplicated, remove with drop_duplicates, and define subset columns that identify a logical record. Critical before merges and aggregations.

Finding duplicates

import pandas as pd
df = pd.DataFrame({'email': ['a@x.com','b@x.com','a@x.com']})
print(df.duplicated())
print(df.duplicated(subset=['email']).sum())

drop_duplicates

clean = df.drop_duplicates(subset=['email'], keep='first')
print(clean)

keep parameter

keep='first' — retain earliest occurrence (default)
keep='last' — retain latest
keep=False — drop all rows that are duplicated

Important interview questions and answers

Q: duplicated vs drop_duplicates?
A: duplicated returns boolean mask; drop_duplicates returns filtered DataFrame.
Q: subset?
A: Only consider listed columns when defining duplicate rows.

Self-check

Count duplicate emails.
Keep first occurrence per email.

Tip: Define subset= columns that identify a logical record—not always all columns.

Interview prep

subset?: Only specified columns define duplicate identity.
keep='last'?: Retain most recent duplicate when deduplicating.

Interview tip Lesson completion confidence

Can you explain this lesson in 30 seconds without reading notes?

Self-reflection (saved on this device)

Not saved yet.

Playground

Runs on the configured server runner (dev: npm run runner with LEARNING_RUNNER_ENABLED=true). Output appears below the editor.

Code runner not available

Server runner is disabled. Set LEARNING_RUNNER_ENABLED=true and LEARNING_RUNNER_URL in .env (see .env.example).

Check yourself

Multiple choice — immediate feedback.

Discussion

Past discussion is visible to everyone. Only logged-in users can post comments and replies.

Starter discussion topics

duplicated subset?
drop_duplicates keep?

No discussion yet. Be the first to ask a question.