L-Diversity
Privacy model requiring at least l distinct sensitive attribute values per equivalence class
L-diversity is a privacy model that extends k-anonymity to protect against attribute disclosure—the revelation of sensitive information (such as a medical diagnosis or salary) even when an individual's identity remains hidden. While k-anonymity asks "can you figure out WHO someone is?", l-diversity asks "even if you can't identify them, can you figure out WHAT they have?"
The core insight: a dataset may technically satisfy k-anonymity while still exposing sensitive attributes for every record. Consider a k=5 anonymous medical dataset where every equivalence class (group sharing the same quasi-identifier values) contains only "cancer" patients. An attacker who narrows someone to that group learns their diagnosis with certainty, even without identifying which specific record is theirs. This is the homogeneity attack that l-diversity prevents.
L-diversity requires that each equivalence class contains at least l distinct values for the sensitive attribute. If l=3, any group of records sharing quasi-identifier values must have at least 3 different diagnoses, salaries, or other sensitive values. This directly prevents the homogeneity attack by ensuring multiple possible sensitive values exist within each group.
Three variants of l-diversity address different attack scenarios. Distinct l-diversity (simplest) requires at least l distinct sensitive values per class. Entropy l-diversity (strongest) uses information-theoretic entropy to ensure no single value dominates the distribution, requiring entropy ≥ log(l). Recursive (c,l)-diversity controls the ratio between most-frequent and less-frequent values.
L-diversity has limitations. The skewness attack exploits cases where an equivalence class has dramatically different attribute distributions than the overall population—if a group has 50% heart disease patients while the population has 1%, an attacker learns valuable information even with l=2. The similarity attack exploits semantic similarity—if all l distinct values are "lung cancer," "liver cancer," "stomach cancer," an attacker still learns the victim has cancer.
T-closeness addresses these limitations by requiring that sensitive attribute distributions within each equivalence class be "close" to the overall dataset distribution, measured by Earth Mover's Distance. Together, k-anonymity, l-diversity, and t-closeness form a hierarchy of syntactic privacy protections.
For liability quantification, datasets satisfying k-anonymity but failing l-diversity tests carry latent attribute disclosure liability. Due diligence red flags include claimed anonymization without l-diversity analysis, medical/financial data with l < 3, and no documentation of anonymization methodology.
Formula
commonly recommended for sensitive data