AI quality ceilings are set by data quality: coverage, accuracy, timeliness, and representativeness. Models amplify patterns in data—including mistakes and historical bias.
Data sources
- Operational databases and event logs
- User-generated content (reviews, uploads)
- Third-party datasets and APIs
- Synthetic or augmented data (use with validation)
Quality dimensions
| Dimension | Question |
|---|---|
| Completeness | Are key fields missing? |
| Accuracy | Do values match reality? |
| Consistency | Same entity, same ID everywhere? |
| Timeliness | Fresh enough for the decision? |
| Representativeness | Does train data match production users? |
Inventory sketch
# Document datasets before modeling
datasets = [
{"name": "clicks", "rows": 1_000_000, "pii": False},
{"name": "support_tickets", "rows": 50_000, "pii": True},
]
for d in datasets:
print(d["name"], "PII:", d["pii"])Practice: Optional snippets use pandas-style pseudocode—run with Pandas locally if you want tactile practice.
Important interview questions and answers
- Q: Garbage in, garbage out?
A: Noisy labels and missing groups limit any algorithm's ceiling. - Q: PII in training?
A: Requires legal basis, minimization, and secure storage—see privacy lessons.
Self-check
- List three data quality dimensions.
- Why document PII before training?
Tip: Inventory datasets with PII flags before any modeling conversation.
Interview prep
- Representativeness?
- Training data should match production users and conditions.
- PII before modeling?
- Document lawful basis, minimization, and secure handling.