Hello, I'm Eric
Thanks to a scientific calculator from my uncle and a copy of The Student’s Companion from my mother, I
fell in love with math, science, and education. As a young man, I wondered, how could I do all three?
I am from Zimbabwe where we have a shortage of STEM teachers. Schools needed smart, interested high
school grads to help in classrooms. So after high school, I taught for three terms at three schools
before leaving for university at UC Berkeley - my first time in the US! While there, like many
undergrads before me, I met a professor who changed my life: he introduced me to computer science.
For the next three summers I interned at Twitter, which was an excellent learning experience, but I
missed teaching. I took a semester off and started
Emzini weCode to help young people from Zimbabwe discover the joys and power of learning
computer science. Initially I wanted to teach coding basics and help with internships, jobs, and college
applications. But my vision expanded to democratizing access to computer science education. I still
focus on teaching and mentoring, but my audience has grown to more than 1,500 students, teachers, and
mentors from across the globe and the number of courses offered keeps growing.
Meanwhile, after graduating, I took a job as a privacy engineer at Good Research. Like other software
engineers, I write, test, debug, and ship code. What’s unique about my job is that I am essentially
implementing “privacy by design.” I am responsible for proactively tackling privacy concerns early and
At Emzini weCode, I want to put my privacy engineering skills to work to assure that we collect, process,
analyze, and store student data in a way that’s legal, safe, and respectful. To do so, I am using
de-identification, the process of removing or altering personally identifiable information. While the
goal is to transform a dataset to make it impossible to re-identify someone, there are a lot of
techniques out there. I wanted to significantly reduce the chance of identifying students for anyone who
accessed the data, maliciously or not. I am learning a lot!
What data do I need? What data do I absolutely need, and what is nice to have? How do I know? Excellent
questions for any privacy engineer and often the first step to working with data!
The students apply via a Google Form. The data is stored on a Google Sheet, and I am the only person
with access. There’s a range of questions about demographics, background, interests, and knowledge. I
need basic personal data like name, phone number, and email to communicate with students, and facilitate
remote group discussions. I also need data like access to a laptop, math background, and exposure to
command line and other tools or concepts to adapt teaching styles and enable teaching assistants to
support the students. Importantly, funders want data to better understand impact, so I ask for age,
educational background, and location.
De-identifying all of the personal data limits my ability to communicate with and support the students.
As you can see, I need different data for different audiences. Each use-case should only get the data
needed, so I create separate pipelines that are populated with only the minimal data necessary. Each
pipeline can only be accessed by the people that need it, and the data from the pipelines is not easy to
merge, in order to prevent somebody from contrasting them and increasing the risk of re-identification.
Therefore, “What data?” is an incomplete question. In practice I ask, “What data do I need, for whom,
De-identification and comfort levels
There are several approaches to preventing re-identification. At Good Research, we use a risk-based
approach. What are the risks of identifying a student? How can you reduce the risks to a comfortable
level? What even is a “comfortable level”?
Comfort level is not universal. It’s not even standardized within a company or organization. Determining
the risks can be handled per use-case, but use-cases are not static. In privacy engineering, there are
rarely straightforward answers. Instead, they are highly dependent and contextual. Thus, the more
conversations with the more stakeholders, the better. These conversations help you to figure out what
comfort level is appropriate at any given time.
Here are a few examples of the risks. There are often many, many more and can be challenging to uncover.
- Identifying full name from phone number
- Partial re-identification through area codes
- Infer location by partial name
- Student success stories can be used to re-identify specific students that made it to “flagship”
Once I determine the various pipelines of data and establish a comfortable level of risk for each one, I
then look at the types of identifiers.
Direct identifiers include not only personal
information, but also identifiers of context (e.g.
hardware identifiers of WiFi access points or IP addresses), device identifiers (e.g. hardware or
other persistent identifiers of devices used).
Email address, first name, middle name, surname, phone
Quasi Identifiers (in some cases, defined as a
specific type of indirect identifier) can include
demographic information, along with other group level information. Quasi identifiers pose risks
since a combination of them can render a unique footprint, and can be used to match with other
external information in enhancement or linkage attacks.
Education, job history, location, timezone, operating system, age
range, responses to open questions
Indirect identifiers are everything else. In most
cases they are considered to be lower risk for
re-identification, but, to be honest, they can still be combined together to form unique or
semi-unique fingerprints of someone.
Access to computer, proficiency with Python, exposure to command
line tools, math prerequisite
With pipelines defined and data fields grouped by identifiers, then comes the work of de-identifying,
which we will cover in more detail later this year. However, it all started with asking the right
questions. As my colleague Will Monge says, “There is no such thing as safe, rather just ‘safe-enough.’
With datasets, how safe the data is depends on multiple factors, from the data itself to the methods we
can use to secure it.”
In a way de-identification is an optimization problem, albeit a complicated one because people’s privacy
is at stake. There’s actually more than “just” privacy; there’s trust, respect, and safety. Unlike other
optimization challenges, you can’t just code a solution. I think this is something that I was not taught
at school and am quickly learning at Good Research: privacy is not just a technical problem. It’s
nuanced and contextual and ever changing.
As a privacy engineer, it is my responsibility to keep asking questions and to keep people safe. As a
teacher, my students learn about privacy, trust, respect, and safety, in addition to programming and
other technical “hard skills.” With more engineers building with privacy in mind, the less of a burden
on individuals to protect themselves, and we get closer and closer to Good Research’s vision for
everyone to have the knowledge and agency to thrive in a digital world.
Thanks to Will Monge and Jessica Traynor.