There is no such thing as safe, rather just “safe-enough”. We can only make it increasingly harder
for an attacker to be able to re-identify the dataset.
For instance, let’s assume that we have a bicycle we want to “secure”. We could buy a lock and lock
it to a lamp post outside our door. Is that safe? Maybe. Where that post is, how long the bike is
going to be locked there, how expensive-looking the bike
is, and the lock’s strength, will all affect how “safe” we can say our bike is. If we are locking
the bike outside of a coffee shop for a couple of hours, then we might not need a massive chain. But
if we are leaving the bike overnight
in a city like New York, we might need multiple massive chains to lock each wheel and the frame. And
even so, an attacker with sufficient time and equipment may be able to steal it.
Similarly, with datasets, how safe the data is, depends on multiple factors, from the data itself to
the methods we can use to secure it.
Given the above, it’s easy to assume that the only way to truly defend a dataset is to not publish
it. However, sometimes we do need to publish or share a dataset.
Once the decision to share the data is made, it’s important to consider why we are doing it—i.e. how
it is going to be legitimately used— and how we think attackers are going to attempt to exploit
certain information in the dataset. With this information,
we can focus our own efforts on making illegitimate uses more cumbersome, while trying to keep
legitimate uses of the data as productive and fruitful as possible.
There are different threats that people use de-identification to defend against. However, in
general, we consider two things to guard against; specifically, an attacker should not be able to
respond to these questions:
- Which person (or persons) does each specific part (entry, group of rows, column, etc.)
represent in our dataset?
- Is a given person, X, in the dataset?2 (regardless
of which row they may correspond to)
- In some cases, we consider the converse as well: Is a given person X not in the dataset?
There are two main approaches we can use:
- Defend the data itself
- Create specific “safe” methods to use on the data, and only access the data with those
methods (rather than the full data)
Traditional4 approaches try to defend the data as much as
possible by removing, changing, or grouping information that could be used to identify individuals.
Removing information entails redacting or censoring specific information that is considered
sensitive or potentially risky for re-identification; this approach can vary in sophistication.
Consider the three following examples (from simple to more sophisticated):
- Removing all mentions of a specific word
- A rule-based approach of removing all names and addresses
- Removing a subset of different categories of information, either randomly or following some
The complete removal of data may hinder the legitimate use of the dataset too much. Sometimes, it’s
better to have somewhat inaccurate data than no data at all; another approach is to slightly modify
these pieces of information when possible, e.g. move
dates, round numeric values, pseudonymize names, etc. The challenge in this approach is finding the
way to make the data inaccurate-enough as to deter illegitimate uses, but accurate enough that
legitimate uses are still useful.
In some cases, maintaining the dataset’s original level of granularity may not be necessary for the
original legitimate purpose. In these cases, it is almost always better to group and aggregate the
data, reducing the level of granularity of the data;
and with it, reducing the amount of information that could be used for illegitimate purposes.
In a grouped dataset, each entry will represent a group of entries from the original dataset that
share some common characteristics: e.g. ages between 10-20, living within the zip code 35254 (or in
a zip code that starts with 352**), etc.
Information in that bucket (e.g. survey responses) can be then aggregated into averages (e.g. mean,
median), ranges (e.g. max, min, percentiles), counts, sums, and many other aggregations that may
prove useful for legitimate purposes.
As with any other approach, grouped data is also nuanced: some aggregations may remove too much
information, while others might not protect it enough. We must find which information to aggregate
and how much we should aggregate it. The
following questions are the main points we should ask ourselves when considering grouping:
- Are we aggregating enough?
- How many candidates for a group are there in the general population?
- Are the people within each group different enough?
Let’s go over each of these questions one at a time:
If there is a relatively small number of people in a group, if an attacker knows which group someone
belongs to they might be able to identify them.
For example, let us consider a dataset in which we have flight information aggregated by origin and
destination in a single day (e.g. all the Boston-Chicago flights are aggregated together). If there
are only one or two trips from Salt
Lake City to Birmingham a day, then an attacker might gain too much information about these trips.
This concept is known as k-anonymity5. A 5-anonymous dataset
is one where in each group there are at least 5
entries of the original dataset.
In some cases, there may be a relatively small number of people in the “universe”6 (not just within our dataset) that could belong to a specific group.
This might prove dangerous in some cases
and we may need to guard against this scenario.
For example, consider a voting registry for the Democratic primaries in which we aggregate by highest
education level and age group. If we know there are only 2 people with a master’s degree in a
particular zip code, if we look at the
aggregated dataset for that particular zip code and age group we can get a good idea whether these
people are registered Democrats or not.
Defending against the number of potential candidates in the outside universe to each of our groups is
a concept known as k-map7. Similarly, delta-presence8 is when both k-map and k-anonymity are combined, and we
control for each of our groups the proportion of people from the total universe in our dataset.
If all the people in a specific bucket are too similar in a specific category, then an attacker who
knows that an individual is in that bucket might gain too much information.
For example, let’s say we ask each of the students in a class to give their teacher a 1-10 score, and
we aggregate the dataset by giving the average of each row of students in the class. If any of the
rows has an average value of 10, then
we know how all the students in that row feel about the teacher.
This concept is known as l-diversity. A dataset is 3-diverse if in each group there are at least 3 distinct values from the original dataset.
In a more nuanced example, imagine we also asked those students if there are any diabetics in their
family, and among the entire class 30% have diabetics in their family. If we observe that 80% of
students in row 5 have diabetics in their
family, then an insurance company may want to charge students in this row a higher premium.
Although this is a very contrived example, a party might make use of uneven proportions in voting to
redefine the district boundaries and “gerrymander”9 a
This concept is known as t-closeness. It is a more nuanced concept than l-diversity because we also
care about the difference between the distributions of particular groups and the overall dataset,
i.e. is this group significantly different
from the general population?
If we take a moment to pause and think about what we are doing with the data, we could conclude
that what we are doing is reducing the dimensions of the data. It may sound odd or too complex,
but consider the following:
Our original dataset is rich with different types of information about each person (or
transaction, etc.). If we take a typical demographic dataset as our example, we might be
providing information about where the person lives, their
age, workplace, income, the car they drive, etc. The more of these different “dimensions”, the
more distinct each person is going to be; there might be not too many blond 34-year olds living
in Plymouth, MA, driving a Toyota Prius,
but I’m sure there are a lot more 34-year olds living in Plymouth.
The same happens geometrically. If we randomly pick four numbers between 1-8, we would consider
it very unlikely to find a point to be at least 3 in distance from (greater or smaller than) its
closest neighbor. But if we add more dimensions,
and we pick 16 (because it’s 4 squared) random squares in a chessboard (8 by 8), it’s much more
likely that we find at least one of these squares at least 3 squares away from its closest
neighbor. If we add yet another dimension,
say an office building with 8 floors and each floor with an 8 by 8 cubicles grid, and we
distribute randomly our 64 (4 cubed) hires among the building, it will be even more unlikely to
find two of them sitting close enough to whisper
to each other… and so on.
This means the more dimensions we have, the easier it is to find “unique points” (e.g. points
that are not similar to any of the others) and in fact, every point tends to become more
“unique”; making our attacker’s task much much easier.
So, our task of defending data is to geometrically make the points closer to each other i.e.
making them less distinguishable from each other. The strategies we just talked about also have
their geometric analogs:
- Removing data is almost like projecting them into a specific plane and removing one of
their dimensions, like using a bird's shadows on the ground rather than their position
in the sky: it seems like the two birds are really
close to each other, but they might just be flying at very different altitudes.
- Changing the data is like moving a point; the strategy would be to move points closer
together, so they form a tighter group.
- Finally, aggregating data is like creating opaque rectangles, prisms, or forms that group
a bunch of points together. We can’t see inside the rectangles and we don’t know how
many points are inside.
As we’ve seen, defending the data itself is hard, because it’s very hard (or impossible) to come up
with all the illegitimate ways an attacker may try to use the data for, and we never know what
attackers may already know or what information they might
gain in the future; think of it as trying to defend against the unknown.
There is a different approach to de-identification: Differential Privacy.
In Differential Privacy, we forget about trying to defend
the data and focus on creating “safe processes” for people to make use of data, and we only let
them use these processes. A process10 can be
any way of making use of the data: a query, statistic (like the sum, mean, 95th-percentile, etc), or
the training of a model on such data (outputting only the model parameters).
A process will be considered to be differentially-private
if, after using it, an attacker doesn’t have a significantly better idea of whether a subject is
within the dataset or not.
The interesting thing about DP is that we can guarantee the
DP-ness by looking at the process itself, and remove some of these hypothetical unknowns.
How do we do this you say? Let’s dive
We will construct two datasets: one -\( D \)- with all our subjects, and one -\( D^* \)- without one
of our subjects, x. Now, if the result of applying the process, \( f \), on one of the datasets - \(
f(D) \) - is sufficiently similar to that of applying the
mechanism on the other dataset -\( f(D*) \)-, then we can guarantee11 DP-ness!
In other words, since the results of \( f(D) \) and \( f(D*) \)
are too similar, given the output obtained an attacker can’t be sure if it comes from \( D \) or
from \( D^* \), and therefore they can’t be sure if x is in the dataset.
This means that by simply running a few computations of this
process, we can determine if it is “safe-enough” or not.
We have only vaguely mentioned some key concepts, like “sufficiently similar”, but before we go into
those let’s give a less abstract example. In practice, Differential Privacy is usually obtained by
adding a particular kind of noise to
the result so as to never give out the real value, but give something close-enough instead. The
particularity of this random noise is that it has a mean of zero, so that in the long run, the
errors balance out. This means that although
each of our answers is slightly off, in the aggregate we are still spot on (in other words, we have
not introduced any bias, just some variance). This allows our users to still make accurate
predictions and useful inferences by using
the processes we have defined.
One weakness to this approach is that an attacker could run the same process multiple times and
average the results out, thereby obtaining a good estimation of the “true answer”. For this reason,
the notion of a privacy-budget was introduced;
each user of the process is assigned a specific budget, and as they make use of the process, their
budget is reduced. The process could potentially add more noise based on the budget left, and at
some point restrict the access entirely
when the budget has been depleted. However, managing this privacy-budget is complicated, and can
also be defeated by coordinated attackers using multiple budgets.
By this we mean that the results have not greatly influenced the attacker’s initial hypothesis of
whether or not a specific subject is, or isn’t, within the dataset.
We will consider that knowing a subject is within a dataset is 100%, knowing the subject is not
within the dataset is 0%, and not having any clue whether the subject is or isn’t is 50%. Let’s
assume prior to running the process an attacker
has a slight belief that subject x is within the dataset, i.e. the attacker starts with 55%
confidence of subject x being in the dataset. If the process is \( \epsilon
the attacker must, at most, be \( e^\epsilon \)
* 0.55 confident after observing the result of the process. As such, \( \epsilon \) limits how much
attacker’s belief can improve after seeing the results13.
One of the first concepts we mentioned is the definition of \( f(D) \) and \( f(D^*) \) being
similar”. The first consequence of this is that f is now forced to be non-deterministic, i.e. if I
run the process two times I might get two different answers,
like rolling a die. If the process did not have some randomness to it, i.e. if it was completely
deterministic, then it would be easy to determine whether \( f(?) \) came from \( D \) or \( D^* \)
How do we compare two non-deterministic processes? By looking at the probabilities of the possible
outcomes and comparing these. Let’s say we want to compare two coins: one is a normal heads-tails
coin, the other has heads on both sides.
Now let’s say I have one of these two coins and you have to figure out which one it is, and I can
only tell you what the result of the flips are. If I flip the coin and get tails, then you know it’s
the normal coin, but if I were to
get heads, then you can’t be sure which one it is. However, if I flip the coin 3 times and they all
come heads, then you are slightly more certain that I am using the fake coin. The more flips I tell
you, the more probabilistically
different the coins will be.
Now imagine I have two 100-sided dice: one has numbers 0-99 and the other has numbers 1-100. Even if
I roll one die 10 times and tell you the results, you would not be able to say which of the two I
used unless I rolled a 0 or 100, which
is quite unlikely (less than 10%14). If we decide 10% is
sufficient, then we can say that these two dice are sufficiently similar.
In practice, we can compute the overlap between the probability distribution of each, and determine
whether they are sufficiently close or not. However, this comes with its own nuances as well.
Determining the probability distribution
for a well-known, simple, mechanical process such as a die roll or a coin flip is straightforward,
but when we try to come up with the distribution of obtaining a mean sentiment of a series of
comments for a product, this can prove
to be a bit harder and more subjective. For such cases, we use specific probability function
families, based on our knowledge of the process and its context, and try to adjust the parameters as
much as possible using data from other
populations or even our own observed subsets. This is not impractical or wrong, but it showcases the
nuances, and how even a mathematical approach can include subjective decision-making.
In order to determine if a specific process, f, is differentially private, we need to make sure that
given a dataset D, for any single-record modification (adding or removing) of this dataset (which we
will call \( D^* \) ), \( f(D) \) and \( f(D^*) \) have to be sufficiently similar (which we already
determined in the above point B to be nuanced)
Even when the probability distribution is fairly straightforward, coming up with all the potential
single-record modifications that can be performed for the given dataset is impractical, if not
downright infeasible. In order to tackle this issue, we can also make statistical assumptions about
the type of data we could add, model the additions and subtractions, and calculate the bounds of
impact from these changes. However again,
this introduces potential biases and the subjective interpretations of the analyst into the
The privacy-budget is a quantity that helps limit how much information a user can learn by using the
differentially-private process, without compromising the guarantees of differential privacy.
This privacy-budget can be implemented in multiple ways, such as a fixed-price for each process run,
or even dynamically priced based on the specific query or results, depending on how potentially
dangerous they could be. Maintaining and controlling the privacy-budget, so that attackers that
share information or coordinate queries don’t side-channel the safeguard, is usually one of the
biggest challenges of the main implementations of