Data privacy issues are among the hottest topics with the most active research in the machine learning space. This is not surprising, since almost every month we hear news about data breaches of the largest tech companies and enterprises among recent ones Facebook’s data leak of 540 million records or Adobe’s exposed 7.5 million customers.
In this series of blog posts, we will offer some practical tips that data scientists can use to protect the privacy of their users.
Here, we suggest using data access protocols. These protocols are similar to the protocols used in clinical research trials which govern who has access to patient data and how they are going to use it.
A data access protocol describes the series of steps you will take to access and handle data, with justification for each step.
Every step in your protocol should come with code and be reproducible. Bonus points if you can put each of the steps in the protocol into a continuous integration pipeline so that each step of the protocol triggers once its predecessors have completed.
Creating a data access protocol might seem like unnecessary red tape, but it is invaluable when you’re working with sensitive data which contains PII (personally identifiable information) or PHI (protected health information). The data access protocol should make explicit:
- What you’re trying to achieve
- What data you’re going to use to achieve it
- How that data is going to be handled
- What information you’re going to extract from the data that you access.
Always make it easy to perform a security review of the steps executed under a given protocol. If one step in your protocol involves training a machine learning model on a VM, make that VM or an image of it available for security review for a limited amount of time. If another step involves publishing that model for use in an API, keep the published model available for evaluation even after it’s been superseded by a better one.
Modern project management and continuous integration tools are a great aid when automating such protocols. Tools like JIRA or CI/CD platforms like GitHub Actions, CircleCI, and TravisCI implement all the features you need to completely automate the steps in most data access protocols. Learning how to use these tools is time well spent.
In the next articles we will cover how to set up secure communication with your execution environments and how to choose data formats with an eye to security audits. Stay tuned!!