Find duplicates with duplicated, remove with drop_duplicates, and define subset columns that identify a logical record. Critical before merges and aggregations.
Finding duplicates
import pandas as pd
df = pd.DataFrame({'email': ['a@x.com','b@x.com','a@x.com']})
print(df.duplicated())
print(df.duplicated(subset=['email']).sum())
drop_duplicates
clean = df.drop_duplicates(subset=['email'], keep='first')
print(clean)
keep parameter
keep='first'— retain earliest occurrence (default)keep='last'— retain latestkeep=False— drop all rows that are duplicated
Important interview questions and answers
- Q: duplicated vs drop_duplicates?
A: duplicated returns boolean mask; drop_duplicates returns filtered DataFrame. - Q: subset?
A: Only consider listed columns when defining duplicate rows.
Self-check
- Count duplicate emails.
- Keep first occurrence per email.
Tip: Define subset= columns that identify a logical record—not always all columns.
Interview prep
- subset?
Only specified columns define duplicate identity.
- keep='last'?
Retain most recent duplicate when deduplicating.