Linkage Attack
Technique combining quasi-identifiers across datasets to re-identify individuals
A linkage attack is a privacy attack that exploits quasi-identifiers to connect records across different datasets, ultimately re-identifying individuals in supposedly anonymous data. The attack works by finding common attributes between a target dataset (with direct identifiers removed) and auxiliary information (voter rolls, social media, commercial databases, public records). When attribute combinations are sufficiently unique, an adversary can match anonymous records to real-world identities.
The mechanism is straightforward but powerful. If an anonymized health database contains records with ZIP code, gender, and birth date, and a public voter registration list contains these same attributes along with names, the attacker performs a database join on the common fields. Any record with a unique attribute combination becomes linked to an identity. Research demonstrates that 87% of the U.S. population can be uniquely identified using only these three quasi-identifiers—meaning most records in a typical dataset are vulnerable to linkage if these fields are present.
It is important to distinguish two outcomes of linkage attacks. Record linkage (re-identification) occurs when an attacker correctly matches a specific row in the anonymized dataset to an individual's identity—the classic "Row #456 is William Weld" scenario. Attribute linkage (inference) occurs when the attacker cannot identify the exact record but can infer sensitive values with high confidence because linked records share those values. If all individuals in a demographic group have the same diagnosis, knowing someone belongs to that group reveals their medical condition. Both outcomes create liability: record linkage breaches confidentiality, while attribute linkage breaches privacy.
The canonical demonstration occurred in 1997 when Latanya Sweeney linked Massachusetts hospital records to voter registration data. The state had removed names and Social Security numbers, believing the data was anonymous. Sweeney purchased both datasets for $20 each, joined them on ZIP code, birth date, and gender, and identified Governor William Weld's complete medical records. More recently, researchers de-anonymized Netflix viewing histories by linking to public IMDb ratings using just a few movie preferences per user.
Linkage attacks are the engine driving the mosaic effect. A single dataset might carry acceptable risk in isolation, but if an organization releases multiple datasets with overlapping attributes, attackers can combine them to create composite profiles. Liability is therefore cumulative: the risk score must account for all data an entity has released, not just individual datasets.
Defenses include generalization (replacing specific values with broader categories), suppression (removing records with unique combinations), and noise addition. K-anonymity formalizes protection by requiring that every attribute combination appears in at least k records. However, these defenses always trade privacy against data utility—and high-dimensional datasets with many attributes often cannot be adequately protected without destroying analytical value.
See Also
Sources
- Sweeney, L. (1997). Weaving Technology and Policy Together to Maintain Confidentiality. Journal of Law, Medicine & Ethics.
- Narayanan, A. & Shmatikov, V. (2008). Robust De-anonymization of Large Sparse Datasets. IEEE Symposium on Security and Privacy.
- Sweeney, L. (2000). Simple Demographics Often Identify People Uniquely. Carnegie Mellon University Data Privacy Lab.