stats.pearsonr measures linear association; stats.spearmanr uses ranks and handles nonlinear monotonic trends. Both return correlation and a p-value under null of no association.
Pearson vs Spearman
- Pearson — linear relationship; sensitive to outliers
- Spearman — monotonic relationship via ranks; more robust
- Report r and p-value; visualize with scatter plot
Correlation ≠ causation
High correlation does not imply one variable causes another. Confounders and spurious time trends are common interview topics tied to Data Science literacy.
Example
import numpy as np
from scipy import stats
x = np.array([1, 2, 3, 4, 5], dtype=float)
y = np.array([2.1, 3.9, 6.2, 7.8, 10.1])
r, p = stats.pearsonr(x, y)
print('pearson r:', r, 'p:', p)
Important interview questions and answers
- Q: r near 0?
A: Little linear association—nonlinear patterns may still exist; plot the data. - Q: When Spearman?
A: Ordinal data, rank-based relationships, or heavy outliers.
Self-check
- Difference between Pearson and Spearman?
- What two values does pearsonr return?
Pitfall: Pearson r measures linear fit—plot scatter before claiming association.
Interview prep
- Pearson vs Spearman?
Pearson linear; Spearman rank/monotonic—more robust.
- Causation?
Correlation does not imply causation—confounders matter.