Re-identification Probability

Re-identification probability quantifies the risk that a specific individual can be identified from a dataset that has been de-identified or anonymized. This metric is crucial for assessing whether data protection measures are adequate, for calculating potential regulatory exposure, and for informing M&A due diligence valuations.

The probability depends on multiple factors: the uniqueness of quasi-identifier combinations, the availability of auxiliary datasets for linkage, the population size, dataset dimensionality, and the adversary's resources and motivation. Research has quantified these risks with striking precision. The foundational statistic is that 87% of the U.S. population can be uniquely identified using only three data points—ZIP code, date of birth, and gender. For behavioral data, the numbers are even more sobering: four credit card transactions uniquely identify 90% of individuals, and eight movie ratings (with 14-day timestamp tolerance) identify 99% of users.

The concept of "unicity" formalizes this measurement—it represents the minimum number of data points needed to single out an individual within a dataset. Lower unicity means higher re-identification probability. High-dimensional datasets with many attributes (transaction logs, browsing histories, location traces) have extremely low unicity, making them inherently high-risk regardless of whether direct identifiers have been removed.

Regulatory frameworks increasingly demand quantified re-identification assessment. HIPAA's Expert Determination method requires a qualified expert to certify that re-identification risk is "very small"—k-anonymity analysis with k ≥ 5 is commonly used to meet this standard. GDPR sets a higher bar under Recital 26, requiring consideration of "all means reasonably likely to be used" for identification, including available technology and its expected development. CCPA requires that de-identified data "cannot reasonably" be linked to consumers and mandates both technical and administrative safeguards.

Even datasets compliant with safe harbor standards face residual risk. Research on HIPAA Safe Harbor hospital discharge data demonstrated re-identification rates of 3.2% (Maine) to 10.6% (Vermont) using only newspaper name matching—high-profile patients with newsworthy conditions proved especially vulnerable. This illustrates that regulatory compliance sets a floor, not a ceiling, for re-identification protection.

The financial stakes of underestimating re-identification probability are substantial. Even a 1% probability translates to 10,000 identifiable individuals in a million-record dataset—each representing potential regulatory fines, litigation exposure, and reputational damage. Liability Quant's scoring methodology incorporates re-identification probability as a key component of the Asset Toxicity Score, weighting quasi-identifier density, dataset dimensionality, and available linkage datasets to produce a quantified risk metric.

Formula

Related Terms

Related Regulations

Sources

Re-identification Probability

Formula

Related Terms

Related Regulations

Sources