Part 4: Standard Ways to Process Datasets with QI Values
K-anonymity: This approach is quite different from the one described earlier. With K-anonymity, the aim is not to ‘hide’ any data, but rather soft ‘masking’ of the QI values. The most popular techniques used in k-anonymity are purging and generalization. Purging simply replaces QI values with random strings like ‘-’ (similar to suppression). Generalization does not remove QI values completely but replaces them with ranges instead of set numbers (e.g. 20-30 years old). The main goal of k-anonymity is to provide a guarantee that any arbitrary query on a large dataset will not reveal information that can help narrow a group down below a threshold of ‘k’ individuals. Strictly speaking, ‘k-anonymity’ ensures that all possible equivalence groups of a dataset have at least ‘k’ records (equivalence groups are the subsets of datasets, which have the same value for one or more QIs). For instance, a 3-anonymity dataset ensures that for each query that a potential attacker can perform, we will have at least 3 individuals, which cannot be distinguished based on the QI values.
l-diversity: Unfortunately, k-anonymity techniques may still be subject to attacks, which is usually because each of the equivalence groups may not have attribute diversity. A rare case for this is when all QI records of the equivalence group are the same, enabling the attacker to easily make an inference. l-diversity makes sure that there is enough diversity among QI records in each of the possible equivalence groups.
T-closeness: When speaking about the distributions, which are created by purging and generalization techniques, it is worth noting that the distributions of data in the equivalence groups should be like the distributions in the whole dataset. Specifically, the difference should not be bigger than the pre-specified value ‘t’. Earth mover’s distance is used to measure the distance between the distributions.
One may learn that preserving these rules, which are defined by l-diversity, k-anonymity and t-closeness can cause complex combinatorial problems. At this point in time, machine learning techniques have become quite useful as long as they can operate data in separate hyperplanes and perform computations there, which can be very complex tasks using the approaches described earlier.
This is the fourth part of a five-part series about machine learning methodologies for de-identifying and securing personal data by 1touch.io. For part one, click here. For part two, click here. For part three, click here. For part five, click here.