How to Protect Sensitive Machine-Learning Training Data Without Borking It
Previous columns in this series have presented the problem of privacy in machine learning (ML) and highlighted the real challenge that operational query data poses. That is, if you use an ML system, you are most likely at greater data risk than training one at all.
By my rough estimate, data accounts for at least 60% of the known machine learning vulnerabilities identified by the Berryville Institute of Machine Learning (BIML). This part of the risk (the 60%) further splits about nine to one with operational data disclosure versus training data disclosure. Training data components make up a minority of data risk in ML, but they are an important minority. The upshot is that we need to expend some real energy to mitigate the previously discussed operational data risk issue from ML, and we also need to consider data exposure training.
Interestingly, everyone in this space seems to only be talking about protecting training data. So why all the fuss there? Don’t forget that the ultimate fact about ML is that the algorithm that does all the learning is really just an instantiation of the data in machine-runnable form!
So if your training set contains sensitive data, the machine you construct from that data (using ML) contains sensitive information by definition. And if your training set contains biased or regulated data, then the machine you construct from those data items (using ML) contains biased or regulated information by definition. And if your training set contains confidential company data, then the machine you construct from these data elements (using ML) contains confidential company data by definition. Etc.
The Algorithm is the data and becomes the data through training.
Apparently, the big focus that the ML field puts on protecting training data has some merits. Not surprisingly, one of the main ideas for approaching the training data problem is to correct the training data so that it no longer directly contains sensitive, biased, regulated, or confidential data. On the one hand, you can easily delete these data items from your training set. A little less radical, but no less problematic, is the idea of customizing the training data to mask or obfuscate sensitive, biased, regulated, or confidential data.
Let’s spend some time looking at that.
Owners vs Data Scientists
One of the hardest things in this new machine learning paradigm is who is taking what risk. This makes the idea of where trust boundaries should be set and enforced a bit difficult. For example, not only do we need to separate and understand operational data and training data as described above, but further determine who has (and should have) access to training data in the first place.
Worse still, the question of whether any of the training data items is biased, protected class, proprietary, regulated, or otherwise confidential is an even more sensitive issue.
The important things first. Someone generated the potentially worrying data in the first place, and owns those pieces of data. So, the data owner can end up with a bunch of data that they are tasked with protecting, such as: B. race information or social security numbers or pictures of people’s faces. This is the data owner.
Most of the time, the data owner is not the same entity as the data scientist who is supposed to use data to train a machine to do something interesting. This means security personnel must recognize a significant trust boundary between the data owner and the data scientist training the ML system.
In many cases, the data scientist needs to be kept away from the “radioactive” training data that the data owner controls. How would that work?
Differentiated privacy
Let’s start with the worst approach to protecting confidential training data – doing nothing at all. Or maybe even worse, intentionally doing nothing while pretending to do something. To illustrate this issue, let’s use Meta’s claim about facial recognition data siphoned by Facebook (now Meta) over the years. Facebook has developed a facial recognition system that uses many images of its users’ faces. Many people think this is a massive privacy issue. (There’s also a great deal of genuine concern about how racially biased facial recognition systems are, but that’s for another article.)
Faced with privacy pressures over its facial recognition system, Facebook built a data transformation system that converts raw facial data (images) into a vector. This system is called Face2Vec, with each face having a unique Face2Vec representation. Facebook then said it deleted all faces, although it kept the huge Face2Vec dataset. Note that, mathematically speaking, Facebook has done nothing to protect user privacy. Rather, they maintained a unique representation of the data.
One of the most common forms of taking action against data protection is differential data protection. Simply put, differential data protection aims to protect specific data points by statistically “shrinking” the data so that individual sensitive points are no longer included in the data set, but the ML system still works. The trick is to maintain the performance of the resulting ML system even after the training data has been drilled through an aggregation and “fuzzification” process. If the pieces of data are over-processed in this way, the ML system cannot do its job.
But if an ML system user can determine whether data from a specific individual was included in the original training data (called membership inference), the data wasn’t drilled enough. Note that differential data protection works by editing the sensitive data set itself before training.
One studied – and commercialized – system involves adapting the training process itself to mask sensitivities in a training dataset. The core of the approach is to use the same kind of mathematical transformation at training time and inference time to protect against disclosure of sensitive data (including membership inference).
Based on the mathematical idea of mutual information, this approach involves adding Gaussian noise only to non-beneficial features, leaving a data set obfuscated but its inferential power intact. The core of the idea is to create an internal representation that is camouflaged on the sensitive feature layer.
One cool thing about purposeful function obfuscation is that it can help protect a data owner from data scientists by maintaining the trust boundary that often exists between them.
Build in security
Does this solve the problem of sensitive training data? Not at all. The challenge of each new area remains: the people who design and use ML systems must build security in. In this case, that means they need to recognize and mitigate the risks of training data sensitivity when building their systems.
The time is now. If we construct a whole bunch of ML systems with huge data disclosure risks built in, then we get what we asked for: another security disaster.