Eric Khumalo, Data Scientist & Privacy Engineer
September, 2022
Privacy Engineering in Action

Hello, I'm Eric

Thanks to a scientific calculator from my uncle and a copy of The Student’s Companion from my mother, I fell in love with math, science, and education. As a young man, I wondered, how could I do all three?

I am from Zimbabwe where we have a shortage of STEM teachers. Schools needed smart, interested high school grads to help in classrooms. So after high school, I taught for three terms at three schools before leaving for university at UC Berkeley - my first time in the US! While there, like many undergrads before me, I met a professor who changed my life: he introduced me to computer science.

Emzini weCode

For the next three summers I interned at Twitter, which was an excellent learning experience, but I missed teaching. I took a semester off and started Emzini weCode to help young people from Zimbabwe discover the joys and power of learning computer science. Initially I wanted to teach coding basics and help with internships, jobs, and college applications. But my vision expanded to democratizing access to computer science education. I still focus on teaching and mentoring, but my audience has grown to more than 1,500 students, teachers, and mentors from across the globe and the number of courses offered keeps growing.

Meanwhile, after graduating, I took a job as a privacy engineer at Good Research. Like other software engineers, I write, test, debug, and ship code. What’s unique about my job is that I am essentially implementing “privacy by design.” I am responsible for proactively tackling privacy concerns early and often.

At Emzini weCode, I want to put my privacy engineering skills to work to assure that we collect, process, analyze, and store student data in a way that’s legal, safe, and respectful. To do so, I am using de-identification, the process of removing or altering personally identifiable information. While the goal is to transform a dataset to make it impossible to re-identify someone, there are a lot of techniques out there. I wanted to significantly reduce the chance of identifying students for anyone who accessed the data, maliciously or not. I am learning a lot!

What data?

What data do I need? What data do I absolutely need, and what is nice to have? How do I know? Excellent questions for any privacy engineer and often the first step to working with data!

The students apply via a Google Form. The data is stored on a Google Sheet, and I am the only person with access. There’s a range of questions about demographics, background, interests, and knowledge. I need basic personal data like name, phone number, and email to communicate with students, and facilitate remote group discussions. I also need data like access to a laptop, math background, and exposure to command line and other tools or concepts to adapt teaching styles and enable teaching assistants to support the students. Importantly, funders want data to better understand impact, so I ask for age, educational background, and location.

De-identifying all of the personal data limits my ability to communicate with and support the students. As you can see, I need different data for different audiences. Each use-case should only get the data needed, so I create separate pipelines that are populated with only the minimal data necessary. Each pipeline can only be accessed by the people that need it, and the data from the pipelines is not easy to merge, in order to prevent somebody from contrasting them and increasing the risk of re-identification.

Therefore, “What data?” is an incomplete question. In practice I ask, “What data do I need, for whom, and why?”

De-identification and comfort levels

There are several approaches to preventing re-identification. At Good Research, we use a risk-based approach. What are the risks of identifying a student? How can you reduce the risks to a comfortable level? What even is a “comfortable level”?

Comfort level is not universal. It’s not even standardized within a company or organization. Determining the risks can be handled per use-case, but use-cases are not static. In privacy engineering, there are rarely straightforward answers. Instead, they are highly dependent and contextual. Thus, the more conversations with the more stakeholders, the better. These conversations help you to figure out what comfort level is appropriate at any given time.

Here are a few examples of the risks. There are often many, many more and can be challenging to uncover.

  • Identifying full name from phone number
  • Partial re-identification through area codes
  • Infer location by partial name
  • Student success stories can be used to re-identify specific students that made it to “flagship” universities/employers

Once I determine the various pipelines of data and establish a comfortable level of risk for each one, I then look at the types of identifiers.

Direct identifiers include not only personal information, but also identifiers of context (e.g. hardware identifiers of WiFi access points or IP addresses), device identifiers (e.g. hardware or other persistent identifiers of devices used).
Email address, first name, middle name, surname, phone number

Quasi Identifiers (in some cases, defined as a specific type of indirect identifier) can include demographic information, along with other group level information. Quasi identifiers pose risks since a combination of them can render a unique footprint, and can be used to match with other external information in enhancement or linkage attacks.
Education, job history, location, timezone, operating system, age range, responses to open questions

Indirect identifiers are everything else. In most cases they are considered to be lower risk for re-identification, but, to be honest, they can still be combined together to form unique or semi-unique fingerprints of someone.
Access to computer, proficiency with Python, exposure to command line tools, math prerequisite

With pipelines defined and data fields grouped by identifiers, then comes the work of de-identifying, which we will cover in more detail later this year. However, it all started with asking the right questions. As my colleague Will Monge says, “There is no such thing as safe, rather just ‘safe-enough.’ With datasets, how safe the data is depends on multiple factors, from the data itself to the methods we can use to secure it.”

Optimizing privacy

In a way de-identification is an optimization problem, albeit a complicated one because people’s privacy is at stake. There’s actually more than “just” privacy; there’s trust, respect, and safety. Unlike other optimization challenges, you can’t just code a solution. I think this is something that I was not taught at school and am quickly learning at Good Research: privacy is not just a technical problem. It’s nuanced and contextual and ever changing.

As a privacy engineer, it is my responsibility to keep asking questions and to keep people safe. As a teacher, my students learn about privacy, trust, respect, and safety, in addition to programming and other technical “hard skills.” With more engineers building with privacy in mind, the less of a burden on individuals to protect themselves, and we get closer and closer to Good Research’s vision for everyone to have the knowledge and agency to thrive in a digital world.

Thanks to Will Monge and Jessica Traynor.