How to train Machine Learning models in the cloud using Cloud ML Engine

Chris Rawles
Towards Data Science
5 min readMay 23, 2018

--

And how to artfully write a task.py using the docopt package

Training ML models in the cloud makes a lot of sense. Why? Among many reasons, it allows you to train on large amounts of data with plentiful compute and perhaps train many models in parallel. Plus it’s not hard to do! On Google Cloud Platform, you can use Cloud ML Engine to train machine learning models in TensorFlow and other Python ML libraries (such as scikit-learn) without having to manage any infrastructure. In order to do this, you will need to put your code into a Python package (i.e. add setup.py and __init__.py files). In addition, it is a best practice to organize your code into a model.py and task.py. In this blog post, I will step you through what this involves.

Submitting a job to ML Engine. The file task.py is the file actually executed by ML Engine and it references the model logic located in model.py.

The task.py file

As a teacher, one of the first things I see students, particularly those newer to Python, get hung up on is creating a task.py file. Although it’s technically optional (see below), it’s highly recommended because it allows you to separate hyperparameters from the model logic (located in model.py). It’s usually the actual file that is called by ML engine and its purpose is twofold:

  1. Reads and parses model parameters, like location of the training data and output model, # hidden layers, batch size, etc.
  2. Calls the model training logic located in model.py with said parameters
An example task.py file that parses command line arguments for training data location, batch size, hidden units, and the output location of the final model.

There are many different ways you can write a task.py file — there are even different names you can give it. In fact the task.py and model.py convention is merely that — a convention. We could have called task.py aReallyCoolArgument_parser.py and model.py very_deeeep_model.py.

We could even combine these two entities into a single file that does argument parsing and trains a model. ML Engine doesn’t care as long as you arrange your code into a Python package (i.e. it must contain setup.py and __init__.py). But, stick with the convention of two files named task.py and model.py inside a trainer folder (more details below) that house your argument parsing and model logic, respectively.

Check out the Cloud ML samples and Cloud ML training repos for full examples of using Cloud ML Engine and examples of model.py and task.py files.

Writing clean task.py files using docopt

Although many people use argparse, the standard Python library for parsing command-line arguments, I prefer to write my task.py files using the docopt package. Why? Because it’s the most concise way to write a documented task.py. In fact, pretty much the only thing you write is your program’s usage message (i.e. the help message) and docopt takes care of the rest. Based on the usage message you write in the module’s doc string (Python will call this __doc__), you call docopt(__doc__), which generates an argument parser for you based on the format you specify in the doc string. Here is the above example using docopt:

Pretty nice, right? Let me break it down. The first block of code is the usage for your task.py. If you call it with no arguments or incorrectly call task.py this will display to the user.

The line arguments = docopt(__doc__) parses the usage pattern (“Usage: …”) and option descriptions (lines starting with dash “-”) from the help string and ensures that the program invocation matches the usage pattern.

The final section assigns these parameters to model.py variables and then executes the training and evaluation.

Let’s run a job. Remember the task.py is part of a family of files, called a Python package. In practice you will spend the bulk of your time writing the model.py file, a little time creating the task.py file, and the rest is basically boilerplate.

training_example # root directory
setup.py # says how to install your package
trainer # package name, “trainer” is convention
model.py
task.py
__init__.py # Python convention indicating this is a package

Because we are using docopt, which is not part of the standard library, we must add it to setup.py, so we insert an additional line into setup.py:

This will tell Cloud ML Engine to install docopt by running pip install docopt when we submit a job to the cloud.

Finally once we have our files in the above structure we can submit a job to ML Engine. Before we do that, let’s first test our package locally using python -m and then ml-engine local predict. These two steps, while optional, can help you debug and test your packages functionality before submitting to the cloud. You typically do this on a tiny data set or just a very limited number of training steps.

Before you train on the cloud, test your package locally to make sure there are no syntactic or semantic errors.

Once we have tested our model locally we will submit our job to ML Engine using gcloud ml-engine jobs submit training

These two lines are relevant to our discussion:

— package-path=$(pwd)/my_model_package/trainer \
— module-name trainer.task

The first line indicates the location of our package name, which we always call trainer (a convention). The second line indicates, in the trainer package, that we will call the task module (task.py) in the trainer package.

Conclusion

By building a task.py we can process hyperparameters as command line arguments, which allows us to decouple our model logic from hyperparameters. A key benefit is this allows us to easily fire off multiple jobs in parallel using different parameters to determine an optimal hyperparameter set (we can even use the built in hyperparameter tuning service!). Finally, the docopt package automatically generates a parser for our task.py file, based on the usage string that the user writes.

That’s it! I hope this makes it clear how to submit a ML Engine job and build a task.py. Please leave a clap if you found this helpful so others can find it.

Additional Resources

--

--