Unicity
Privacy metric measuring the fraction of individuals uniquely identifiable given a small number of data points
Unicity is a privacy risk metric that quantifies how easily individuals can be singled out in supposedly anonymous datasets. In simple terms, unicity measures how unique your data fingerprint is—if your behavioral patterns (where you go, what you buy, what you browse) are distinct from everyone else in a dataset, you can be identified even without your name attached.
The metric is formally defined as εₚ: the expected fraction of records that are uniquely identifiable given p randomly sampled data points. For mobility data, ε₄ ≈ 0.95—meaning four location-time pairs uniquely identify 95% of individuals. For credit card transactions, four purchases identify 90% of people. These findings, from MIT researcher Yves-Alexandre de Montjoye's landmark studies, demonstrate that high-dimensional behavioral data creates fingerprints nearly as unique as biometrics.
The most counterintuitive finding is that dataset size provides minimal privacy protection. In a dataset of 60 million people, 93% remain uniquely identifiable with just four data points. Unicity follows a convex decay function—it drops quickly at first as population increases, then flattens dramatically. Doubling the population from 30 million to 60 million might only reduce identifiability from 95% to 93%. The common assumption that individuals are "lost in the crowd" of big data is mathematically false.
Unicity exposes the fundamental weakness of k-anonymity for high-dimensional data. K-anonymity ensures each record shares quasi-identifier values with at least k-1 others, but it assumes a fixed, known set of identifying attributes. Unicity demonstrates that in behavioral datasets, any random sample of attributes forms a unique combination. A dataset might satisfy k=5 anonymity while having ε₄ > 0.9—the k-anonymity provides false security because an attacker with four external data points can still identify most individuals.
Regulatory frameworks increasingly recognize unicity-based risks. GDPR's Recital 26 "singling out" test directly addresses whether individuals can be isolated from a dataset—precisely what unicity measures. The UK ICO's "motivated intruder" test asks whether someone with reasonable resources could identify individuals; given that auxiliary data for unicity attacks is readily available through social media and public records, high-unicity datasets fail this test. The FTC has levied significant fines ($16.5 million against Avast in 2024) for selling "anonymized" browsing data that remained re-identifiable—enforcement that validates unicity research findings.
For due diligence, unicity provides actionable thresholds. Datasets with estimated ε₄ > 0.5 (50%+ identifiable with four points) should be flagged as high re-identification risk regardless of claimed anonymization. Location traces, transaction histories, browsing logs, and app usage data are presumptively high-unicity. The number of quasi-identifier columns itself is a risk factor: research shows 99.98% of Americans can be correctly re-identified using 15 demographic attributes. Any acquisition target claiming "de-identified" data while retaining 10+ attribute columns warrants mandatory unicity assessment.
Formula
Related Regulations
See Also
Sources
- de Montjoye, Y-A., et al. (2013). Unique in the Crowd: The Privacy Bounds of Human Mobility. Nature Scientific Reports.
- de Montjoye, Y-A., et al. (2015). Unique in the Shopping Mall: On the Reidentifiability of Credit Card Metadata. Science.
- Rocher, L., Hendrickx, J.M., & de Montjoye, Y-A. (2019). Estimating the Success of Re-identifications in Incomplete Datasets Using Generative Models. Nature Communications.
- FTC. (2024). FTC Cracks Down on Mass Data Collectors: Avast, X-Mode, and InMarket.