We are continuing our sequence of articles with practical suggestions on how to build machine learning models without compromising user privacy. In the first article we shared how to set up a data access protocol. One of the tips we shared in the first article was to automate every step in your data access protocol.
Today, we’d like to offer some practical tips to achieve this type of automation. Specifically, we focus on how to automate machine learning training.
In a privacy-aware setting, the data scientist will not have direct access to the data they are using to train and evaluate their machine learning models. At best, they will have access to aggregate statistics over their training and evaluation sets. At the very least, however, they will be able to provide criteria by which samples should be admitted into the training dataset and the evaluation dataset.
In such a setup, having specified how the relevant datasets should be constructed from the available data sources, the data scientist will deliver a base model to a remote execution environment. The purpose of this remote execution environment is to stage the datasets specified by the data scientist, update the parameters on the base model using the training set, and evaluate the performance of the model update using the evaluation set.
The problem you have to solve to achieve such a setup is that of communication between the data scientist and the remote environment. Bare bones automation requires the remote execution environment to offer the data scientist a limited API that they can call to submit dataset specifications and base models, and to trigger training and evaluation jobs.
Currently, your best bet is to implement such an API yourself. There are some open source solutions that can help. MLflow is quite popular (it was recently accepted into LFAI) and can help you manage your models. Depending on where your data is stored, you can use database and cloud storage APIs to manage and stage your datasets.
However, you are currently on your own if you want to trigger execution based on updates to these APIs. MLflow is, for the most part, just a model registry. Every data source has its own specific query semantics and execution mechanisms. This situation will hopefully change in the coming months. For example, on the model side, Databricks has openly stated that webhook support is on their roadmap for the MLflow model registry. On the data side, standards like GraphQL provide a clean and transparent means of specifying datasets. With GraphQL in particular, frameworks like Apollo make it easy to expose a GraphQL API over an existing data source. It’s only a matter of time before somebody creates a batteries included framework for remote training and evaluation.
If you know of a tool that makes it easy for data scientists to run training jobs in remote environments, please leave a comment below.