k-Anonymity
Privacy model ensuring every combination of quasi-identifiers appears at least k times
k-Anonymity is a foundational privacy model that provides formal guarantees against re-identification attacks. A dataset satisfies k-anonymity if, for every combination of quasi-identifiers (attributes that could potentially identify an individual when combined), at least k records share that same combination. This means any individual cannot be distinguished from at least k-1 other individuals in the dataset based on quasi-identifiers alone.
The model emerged from landmark research demonstrating that 87% of the U.S. population can be uniquely identified using only three attributes: 5-digit ZIP code, birth date, and gender. K-anonymity addresses this by requiring that each such combination appears in multiple records—if k=5, every combination of demographic attributes must appear at least 5 times, ensuring individuals "hide in a crowd."
Two principal techniques achieve k-anonymity. Generalization replaces specific values with broader categories—a birth date becomes just a birth year, a ZIP code is truncated to the first three digits. Suppression removes or masks records with attribute combinations too rare to anonymize without excessive information loss. The trade-off is always between privacy protection and data utility: higher k-values provide stronger protection but reduce the precision of the data.
Regulatory frameworks reference k-anonymity principles even when not explicitly mandating the technique. HIPAA's Expert Determination method recognizes k-anonymity as a valid methodology for certifying that re-identification risk is "very small." The common practical threshold in healthcare contexts is k ≥ 5, though some researchers advocate k ≥ 10-20 for sensitive datasets. CCPA requires that de-identified data "cannot reasonably" be re-identified and mandates both technical and administrative safeguards. GDPR sets the highest bar under Recital 26, requiring consideration of "all means reasonably likely" for re-identification—a standard that k-anonymity alone often cannot satisfy.
While k-anonymity provides important protections, it has well-documented limitations. The homogeneity attack exploits equivalence classes where all records share the same sensitive attribute value—if everyone in a k=5 group has the same diagnosis, learning someone is in that group reveals their condition. Background knowledge attacks allow adversaries with external information to narrow possibilities within an equivalence class. Most critically, k-anonymity fails on high-dimensional data: research on the Netflix Prize dataset showed that behavioral data with many attributes cannot be meaningfully k-anonymized, as even 8 movie ratings per user were sufficient to re-identify subscribers.
Extensions like l-diversity (requiring diverse sensitive values within each equivalence class) and t-closeness (requiring attribute distributions to match the overall dataset) address some of these gaps. Differential privacy offers an alternative approach with stronger mathematical guarantees. For due diligence purposes, k-anonymity analysis provides a quantifiable, defensible metric—but must be supplemented with assessment of dimensionality, attribute distribution, and available external linkage datasets.
Formula
is commonly recommended
Related Regulations
Sources
- Sweeney, L. (2002). k-Anonymity: A Model for Protecting Privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems.
- Sweeney, L. (2000). Simple Demographics Often Identify People Uniquely. Carnegie Mellon University Data Privacy Lab.
- Narayanan, A. & Shmatikov, V. (2008). Robust De-anonymization of Large Sparse Datasets. IEEE Symposium on Security and Privacy.
- Golle, P. (2006). Revisiting the Uniqueness of Simple Demographics in the US Population. ACM Workshop on Privacy in Electronic Society.