There is no such thing as safe, rather just “safe-enough”. We can only make it increasingly harder for an attacker to be able to re-identify the dataset.

For instance, let’s assume that we have a bicycle we want to “secure”. We could buy a lock and lock it to a lamp post outside our door. Is that safe? Maybe. Where that post is, how long the bike is going to be locked there, how expensive-looking the bike is, and the lock’s strength, will all affect how “safe” we can say our bike is. If we are locking the bike outside of a coffee shop for a couple of hours, then we might not need a massive chain. But if we are leaving the bike overnight in a city like New York, we might need multiple massive chains to lock each wheel and the frame. And even so, an attacker with sufficient time and equipment may be able to steal it.

Similarly, with datasets, how safe the data is, depends on multiple factors, from the data itself to the methods we can use to secure it.

Why publish a dataset?

Given the above, it’s easy to assume that the only way to truly defend a dataset is to not publish it. However, sometimes we do need to publish or share a dataset.

Threat Modeling^{1}

Once the decision to share the data is made, it’s important to consider why we are doing it—i.e. how it is going to be legitimately used— and how we think attackers are going to attempt to exploit certain information in the dataset. With this information, we can focus our own efforts on making illegitimate uses more cumbersome, while trying to keep legitimate uses of the data as productive and fruitful as possible.

What do we mean by defending a dataset?

There are different threats that people use de-identification to defend against. However, in general, we consider two things to guard against; specifically, an attacker should not be able to respond to these questions:

- Which person (or persons) does each specific part (entry, group of rows, column, etc.) represent in our dataset?
- Is a given person, X, in the dataset?
^{2}(regardless of which row they may correspond to) - In some cases, we consider the converse as well: Is a given person X not in the dataset?
^{3}

How do we defend a dataset?

There are two main approaches we can use:

- Defend the data itself
- Create specific “safe” methods to use on the data, and only access the data with those methods (rather than the full data)

Defense #1: Defend the data itself

Traditional^{4} approaches try to defend the data as much as
possible by removing, changing, or grouping information that could be used to identify individuals.

Removing Data

Removing information entails redacting or censoring specific information that is considered sensitive or potentially risky for re-identification; this approach can vary in sophistication. Consider the three following examples (from simple to more sophisticated):

- Removing all mentions of a specific word
- A rule-based approach of removing all names and addresses
- Removing a subset of different categories of information, either randomly or following some count-metrics

Changing Data

The complete removal of data may hinder the legitimate use of the dataset too much. Sometimes, it’s better to have somewhat inaccurate data than no data at all; another approach is to slightly modify these pieces of information when possible, e.g. move dates, round numeric values, pseudonymize names, etc. The challenge in this approach is finding the way to make the data inaccurate-enough as to deter illegitimate uses, but accurate enough that legitimate uses are still useful.

Grouping Data

In some cases, maintaining the dataset’s original level of granularity may not be necessary for the original legitimate purpose. In these cases, it is almost always better to group and aggregate the data, reducing the level of granularity of the data; and with it, reducing the amount of information that could be used for illegitimate purposes.

In a grouped dataset, each entry will represent a group of entries from the original dataset that share some common characteristics: e.g. ages between 10-20, living within the zip code 35254 (or in a zip code that starts with 352**), etc. Information in that bucket (e.g. survey responses) can be then aggregated into averages (e.g. mean, median), ranges (e.g. max, min, percentiles), counts, sums, and many other aggregations that may prove useful for legitimate purposes.

As with any other approach, grouped data is also nuanced: some aggregations may remove too much information, while others might not protect it enough. We must find which information to aggregate and how much we should aggregate it. The following questions are the main points we should ask ourselves when considering grouping:

- Are we aggregating enough?
- How many candidates for a group are there in the general population?
- Are the people within each group different enough?

Let’s go over each of these questions one at a time:

A. Are we aggregating enough?

If there is a relatively small number of people in a group, if an attacker knows which group someone belongs to they might be able to identify them.

For example, let us consider a dataset in which we have flight information aggregated by origin and destination in a single day (e.g. all the Boston-Chicago flights are aggregated together). If there are only one or two trips from Salt Lake City to Birmingham a day, then an attacker might gain too much information about these trips.

This concept is known as k-anonymity^{5}. A 5-anonymous dataset
is one where in each group there are at least 5
entries of the original dataset.

B. How many candidates for a group are there in the general population?

In some cases, there may be a relatively small number of people in the “universe”^{6} (not just within our dataset) that could belong to a specific group.
This might prove dangerous in some cases
and we may need to guard against this scenario.

For example, consider a voting registry for the Democratic primaries in which we aggregate by highest education level and age group. If we know there are only 2 people with a master’s degree in a particular zip code, if we look at the aggregated dataset for that particular zip code and age group we can get a good idea whether these people are registered Democrats or not.

Defending against the number of potential candidates in the outside universe to each of our groups is
a concept known as k-map^{7}. Similarly, delta-presence^{8} is when both k-map and k-anonymity are combined, and we
control for each of our groups the proportion of people from the total universe in our dataset.

C. Are the people within the bucket different enough?

If all the people in a specific bucket are too similar in a specific category, then an attacker who knows that an individual is in that bucket might gain too much information.

For example, let’s say we ask each of the students in a class to give their teacher a 1-10 score, and we aggregate the dataset by giving the average of each row of students in the class. If any of the rows has an average value of 10, then we know how all the students in that row feel about the teacher.

This concept is known as l-diversity. A dataset is 3-diverse if in each group there are at least 3 distinct values from the original dataset.

In a more nuanced example, imagine we also asked those students if there are any diabetics in their family, and among the entire class 30% have diabetics in their family. If we observe that 80% of students in row 5 have diabetics in their family, then an insurance company may want to charge students in this row a higher premium.

Although this is a very contrived example, a party might make use of uneven proportions in voting to
redefine the district boundaries and “gerrymander”^{9} a
district.

This concept is known as t-closeness. It is a more nuanced concept than l-diversity because we also care about the difference between the distributions of particular groups and the overall dataset, i.e. is this group significantly different from the general population?

Side Note: De-identification as Reducing the Dimensions
of the Data

If we take a moment to pause and think about what we are doing with the data, we could conclude that what we are doing is reducing the dimensions of the data. It may sound odd or too complex, but consider the following:

Our original dataset is rich with different types of information about each person (or transaction, etc.). If we take a typical demographic dataset as our example, we might be providing information about where the person lives, their age, workplace, income, the car they drive, etc. The more of these different “dimensions”, the more distinct each person is going to be; there might be not too many blond 34-year olds living in Plymouth, MA, driving a Toyota Prius, but I’m sure there are a lot more 34-year olds living in Plymouth.

The same happens geometrically. If we randomly pick four numbers between 1-8, we would consider it very unlikely to find a point to be at least 3 in distance from (greater or smaller than) its closest neighbor. But if we add more dimensions, and we pick 16 (because it’s 4 squared) random squares in a chessboard (8 by 8), it’s much more likely that we find at least one of these squares at least 3 squares away from its closest neighbor. If we add yet another dimension, say an office building with 8 floors and each floor with an 8 by 8 cubicles grid, and we distribute randomly our 64 (4 cubed) hires among the building, it will be even more unlikely to find two of them sitting close enough to whisper to each other… and so on.

This means the more dimensions we have, the easier it is to find “unique points” (e.g. points that are not similar to any of the others) and in fact, every point tends to become more “unique”; making our attacker’s task much much easier.

So, our task of defending data is to geometrically make the points closer to each other i.e. making them less distinguishable from each other. The strategies we just talked about also have their geometric analogs:

- Removing data is almost like projecting them into a specific plane and removing one of their dimensions, like using a bird's shadows on the ground rather than their position in the sky: it seems like the two birds are really close to each other, but they might just be flying at very different altitudes.
- Changing the data is like moving a point; the strategy would be to move points closer together, so they form a tighter group.
- Finally, aggregating data is like creating opaque rectangles, prisms, or forms that group a bunch of points together. We can’t see inside the rectangles and we don’t know how many points are inside.

Defense #2: Creating safe processes to use the data: Differential Privacy

As we’ve seen, defending the data itself is hard, because it’s very hard (or impossible) to come up with all the illegitimate ways an attacker may try to use the data for, and we never know what attackers may already know or what information they might gain in the future; think of it as trying to defend against the unknown.

There is a different approach to de-identification: Differential Privacy.

An Introduction to Differential Privacy

In Differential Privacy, we forget about trying to defend
the data and focus on creating “safe processes” for people to make use of data, and we only let
them use these processes. A process^{10} can be
any way of making use of the data: a query, statistic (like the sum, mean, 95th-percentile, etc), or
the training of a model on such data (outputting only the model parameters).

A process will be considered to be differentially-private if, after using it, an attacker doesn’t have a significantly better idea of whether a subject is within the dataset or not.

The interesting thing about DP is that we can guarantee the DP-ness by looking at the process itself, and remove some of these hypothetical unknowns. How do we do this you say? Let’s dive in.

We will construct two datasets: one -\( D \)- with all our subjects, and one -\( D^* \)- without one
of our subjects, x. Now, if the result of applying the process, \( f \), on one of the datasets - \(
f(D) \) - is sufficiently similar to that of applying the
mechanism on the other dataset -\( f(D*) \)-, then we can guarantee^{11} DP-ness!
In other words, since the results of \( f(D) \) and \( f(D*) \)
are too similar, given the output obtained an attacker can’t be sure if it comes from \( D \) or
from \( D^* \), and therefore they can’t be sure if x is in the dataset.

This means that by simply running a few computations of this process, we can determine if it is “safe-enough” or not.

Differential Privacy in Practice

We have only vaguely mentioned some key concepts, like “sufficiently similar”, but before we go into those let’s give a less abstract example. In practice, Differential Privacy is usually obtained by adding a particular kind of noise to the result so as to never give out the real value, but give something close-enough instead. The particularity of this random noise is that it has a mean of zero, so that in the long run, the errors balance out. This means that although each of our answers is slightly off, in the aggregate we are still spot on (in other words, we have not introduced any bias, just some variance). This allows our users to still make accurate predictions and useful inferences by using the processes we have defined.

One weakness to this approach is that an attacker could run the same process multiple times and average the results out, thereby obtaining a good estimation of the “true answer”. For this reason, the notion of a privacy-budget was introduced; each user of the process is assigned a specific budget, and as they make use of the process, their budget is reduced. The process could potentially add more noise based on the budget left, and at some point restrict the access entirely when the budget has been depleted. However, managing this privacy-budget is complicated, and can also be defeated by coordinated attackers using multiple budgets.

The Devil Is In The Details^{12}

A. Not Having “A Sufficiently Better Idea” Of Whether A Subject Is Within A Dataset

By this we mean that the results have not greatly influenced the attacker’s initial hypothesis of whether or not a specific subject is, or isn’t, within the dataset.

We will consider that knowing a subject is within a dataset is 100%, knowing the subject is not
within the dataset is 0%, and not having any clue whether the subject is or isn’t is 50%. Let’s
assume prior to running the process an attacker
has a slight belief that subject x is within the dataset, i.e. the attacker starts with 55%
confidence of subject x being in the dataset. If the process is \( \epsilon
\)-differentially-private, then
the attacker must, at most, be \( e^\epsilon \)
* 0.55 confident after observing the result of the process. As such, \( \epsilon \) limits how much
the
attacker’s belief can improve after seeing the results^{13}.

B. Sufficiently Similar

One of the first concepts we mentioned is the definition of \( f(D) \) and \( f(D^*) \) being “sufficiently similar”. The first consequence of this is that f is now forced to be non-deterministic, i.e. if I run the process two times I might get two different answers, like rolling a die. If the process did not have some randomness to it, i.e. if it was completely deterministic, then it would be easy to determine whether \( f(?) \) came from \( D \) or \( D^* \) .

How do we compare two non-deterministic processes? By looking at the probabilities of the possible outcomes and comparing these. Let’s say we want to compare two coins: one is a normal heads-tails coin, the other has heads on both sides. Now let’s say I have one of these two coins and you have to figure out which one it is, and I can only tell you what the result of the flips are. If I flip the coin and get tails, then you know it’s the normal coin, but if I were to get heads, then you can’t be sure which one it is. However, if I flip the coin 3 times and they all come heads, then you are slightly more certain that I am using the fake coin. The more flips I tell you, the more probabilistically different the coins will be.

Now imagine I have two 100-sided dice: one has numbers 0-99 and the other has numbers 1-100. Even if
I roll one die 10 times and tell you the results, you would not be able to say which of the two I
used unless I rolled a 0 or 100, which
is quite unlikely (less than 10%^{14}). If we decide 10% is
sufficient, then we can say that these two dice are sufficiently similar.

In practice, we can compute the overlap between the probability distribution of each, and determine whether they are sufficiently close or not. However, this comes with its own nuances as well. Determining the probability distribution for a well-known, simple, mechanical process such as a die roll or a coin flip is straightforward, but when we try to come up with the distribution of obtaining a mean sentiment of a series of comments for a product, this can prove to be a bit harder and more subjective. For such cases, we use specific probability function families, based on our knowledge of the process and its context, and try to adjust the parameters as much as possible using data from other populations or even our own observed subsets. This is not impractical or wrong, but it showcases the nuances, and how even a mathematical approach can include subjective decision-making.

C. Running A "Few" Computations

In order to determine if a specific process, f, is differentially private, we need to make sure that given a dataset D, for any single-record modification (adding or removing) of this dataset (which we will call \( D^* \) ), \( f(D) \) and \( f(D^*) \) have to be sufficiently similar (which we already determined in the above point B to be nuanced)

Even when the probability distribution is fairly straightforward, coming up with all the potential single-record modifications that can be performed for the given dataset is impractical, if not downright infeasible. In order to tackle this issue, we can also make statistical assumptions about the type of data we could add, model the additions and subtractions, and calculate the bounds of impact from these changes. However again, this introduces potential biases and the subjective interpretations of the analyst into the mathematical formulation.

D. Managing the Privacy Budget

The privacy-budget is a quantity that helps limit how much information a user can learn by using the differentially-private process, without compromising the guarantees of differential privacy.

This privacy-budget can be implemented in multiple ways, such as a fixed-price for each process run, or even dynamically priced based on the specific query or results, depending on how potentially dangerous they could be. Maintaining and controlling the privacy-budget, so that attackers that share information or coordinate queries don’t side-channel the safeguard, is usually one of the biggest challenges of the main implementations of differential privacy.

Footnotes

1. This concept is well known and practiced in the security space, for more information see: An introduction to approachable threat modeling – Increment: Security ↩

2. The fact of a person being within a particular might already reveal private information about them, e.g. Democratic Primary voters, clinical trial for a HIV treatment, etc. ↩

3. If a given person is not in the Democratice Primary voters it might indicate they are Republican (or Independent). Not belonging to a health-related datasets might indicate more sensitive information of an individual not included within the dataset. ↩

4. Although traditional, these are in no means outdated. Some of these techniques are currently in use and improved/researched on. ↩

5. For a formal definition of k-anonymity refer to: Latanya Sweeney's article ↩

6. Being able to calculate, or even estimate this total universe is one of the major drawbacks of the concepts of k-map and delta-presence.

7. For a formal definition of k-map refer to this article by Khaled El Emam and Fida Kamla Dankar ↩

8. For a formal definition of delta-presence refer to this article by Mehmet Ercan Nergiz, Maurizio Atzori, and Chris Clifton ↩

9. https://en.wikipedia.org/wiki/Gerrymandering ↩

10. Technical literature usually denotes a “process” as a “mechanism”. ↩

11. Thanks to “some math”. For a good explanation of the math refer to the proof within this post: https://desfontain.es/privacy/differential-privacy-in-more-detail.html ↩

12. For a great series of accessible write-ups on Differential Privacy refer to: https://desfontain.es/privacy/index.html ↩

13. This can be re-stated more precisely in Bayesian terms: a process is \( \epsilon \)-differentially-private if the support of the result for the hypothesis (i.e. the ratio of the posterior and the prior) is bounded by \( e^\epsilon \) . ↩

14. \( 1 - (99/100)^10 = 0.09562 \) ↩