How is your data being used? Who is it being shared with? Data privacy is a major ethical issue facing data scientists. Access to and analysis of data can be a great benefit—personalized treatment is a remarkable convenience (like Netflix recommendations) and potential societal benefit (such as analyzing health records to develop cutting-edge medical techniques)—but it also carries a heavy responsibility. Most people might not be terribly worried if their Netflix history were made public (save for a bit of shame over a binge-watching habit), but they would take great issue if their medical or GPS records along with personally identifying information were released.
This issue came to a head in August when a group of hackers released a 9.7 gigabyte data dump of illegally-obtained private data from the website Ashley Madison. The files contained log-in information, personal details, and payment transaction history of close to 32 million users on the “premier site for married individuals seeking partners for affairs.”
Journalists immediately took to the data, using it to reveal activity on the site from people such as Josh Duggar, star of the TLC reality show 19 Kids and Counting, a GOP executive director, and around 15,000 .mil or .gov email addresses. The Columbia Journalism Review has posted a lengthy essay on whether or not reporting on the data dump is ethical—the conclusion was split: some say that publishing the data is an ethical breach under any circumstances; others argue that the data is already out there—you can’t unring the bell, so to speak—and publishing it does no further harm.
But we’re not journalists, we’re data scientists. Is there a way to use the Ashley Madison data ethically?
Intentionally Adding Noise to Data
First and foremost, using any data that contains personally identifying information—PII for short—is an ethical violation. To get around this, a baseline solution is simply redacting personal information, referring to users by randomly assigned “identification keys” instead of their full name. But redacting PII isn’t always enough. Take taxi companies like Uber and Lyft—redacting someone’s name doesn’t do much good when you can easily surmise that a particular user who frequently travels from a residential address to a commercial address is likely one of the few people living at one address and working at the other.
The one rising method of handling data while respecting individual’s privacy is to use a technique known as differential privacy. At its most basic level, differential privacy involves adding a bit of noise to your data—fuzzing it up a bit—in order to add a layer of privacy between you the data scientist and the individuals within. It blurs the data points (e.g. information about individuals) just enough to allow the data scientist to look at the patterns in the data without being able to detect details that are specific to the user.
The original form of differential privacy came in the form of what is called “ɛ-differential privacy.” The mathematical definition of ɛ-differential privacy is a bit technical, but the basic idea is that the the presence or absence of an individual’s (e.g. your own) data within a data set limits (based on how small we keep the value of ) the risk of an “attacker” learning information specific to you: the differences in the data with or without you keeps your information private.
How it Works to Protect Users
To understand why we might care about this, imagine we are looking at the data set of the information of all employees in your place of business. Suppose also that we knew that you joined the company at a certain date. Then, by querying the data set about the average age of or income, medical insurance claims, etc. of employees before and after your start date, an attacker could deduce a lot of private information about your individual data.
ɛ-differential privacy comes in by adding a little bit of “noise” to the answers that are returned for each query into the data set. By adding this pinch of noise, queries that are made cannot pinpoint exact differences like the change to exact average age, or income, etc. This way, it’s not possible to make a precise deduction about your personal data by comparing queries made to sets with or without you.
In recent years, there has been a surge of research into what kind of noise to add, how to add it for various use case (data set queries, training machine learning classifiers, etc), and how much “noisiness” is needed for a desired level of protection. For continuous variables like age or income, a popular (but no longer state of the art) technique is to use something called the “Laplace mechanism” which randomly adds (or subtracts) the answer to a query according to certain rules (namely, the random additions or subtractions that satisfy the Laplace distribution). In this way one never knows if the answer returned was exact, but by controlling the amount of noise (by controlling the shape of the Laplace distribution), we can be sure that statistically the total picture is not too far off from the original.
Machine Learning and Regularization
As data scientists, a natural reaction to hearing that we are adding adding noise to the data is that it would make the insights drawn from the data less precise. Remarkably, it turns out that the opposite is often true in machine learning. For instance, adding noise to data in many cases will have what data scientists call a “regularizing effect.”
Regularization is a tool that data scientists use all the time when they are trying to draw inferences about general patterns in their data with machine learning algorithms. It is very important in the practice of data science to avoid overfitting to the data, letting the precise values any one point influence the overall pattern of the data for which the scientist is searching. By “noising up” the data just enough (e.g. with a Laplace distribution that is not too noisy), it turns out that the precise values of individual data points has less influence, but the overall pattern is still preserved so the data scientist actually can do a better job at finding the real signal in the data set.
So what does this mean for the Ashley Madison data? The data is out there, and let’s assume, for the sake of argument, that there are worthwhile insights potentially held within—is it ethical to access and analyze the data? What you need is a gatekeeper of sorts—someone who can access the data and apply a layer of differential privacy before it gets into your hands. Even so, this remains an ethical grey area. The safest route, of course, is to simply not touch it.
This blog post was written with contributions from Mike Tamir, Chief Scientist and Learning Officer at Galvanize.