Deploying ML workloads with Azure and Kubernetes

05 Mar 2018

Today we’ll be exploring deploying Machine Learning workloads to a Kubernetes cluster deployed on the Microsoft Azure cloud. The main focus of this is not a deep dive on the ML methods used, but to focus instead on the infrastructure considerations when deploying these types of workloads.

I’m currently looking for my next DevOps / Cloud infrastructure automation gig - this time focusing on Machine Learning. So it was a good time to try out these technologies used by many ML shops. For a video on how ML companies are using Kubernetes for their workloads check out Building the Infrastructure that Powers the Future of AI .

This post will cover the basics of pushing ML jobs onto a Kubernetes cluster. But we won’t cover more advanced topics like using GPU instances or autoscaling based on workload. I apologize for the terseness of this posting - its a long and complex process and so I’ve kept it terse to prevent it from becoming much too long!

My environment is a an Ubuntu Xenial VM with docker, Xwindows, and a web browser. If you do not have one of these you can spin up a Ubuntu/Xenial VM in Azure. The steps are the same for other flavors of Linux - but the command lines will differ.

All of my Dockerfile and Kubernetes configurations are available here: https://github.com/gregretkowski/kube-es-starter .

Intro to Evolution Strategies and Roboschool

Humanoid Running Task Screenshot

The ML task we’ll be tackling today is to use Evolution Strategies to teach a bipedal robot to walk. A bit about Evolution Strategies from OpenAI:

We’ve discovered that evolution strategies (ES), an optimization technique that’s been known for decades, rivals the performance of standard reinforcement learning (RL) techniques on modern RL benchmarks (e.g. Atari/MuJoCo), while overcoming many of RL’s inconveniences.

– OpenAI Blog: Evolution Strategies

My main interest in it is that it is a learning strategy that is well suited to being deployed as a distributed system, demonstrating some of the features of deploying to a Kubernetes cluster. It was straightforward to take the Evolution Strategies Starter Code and adapt it with a few modifications to be the experiment running on our cluster.

The environment our ES will train on is the Roboschool Humanoid task. Roboschool is a set of environments from OpenAI which replicate the Mujoco environments but use an open source physics engine, Bullet3. Because the cost for Mujoco licenses - the cheapest license available for anyone to run within a container is $2000 - the availability of Roboschool opens up experimentation with physics environments to a much wider audience.

What our implementation will look like

We will set up a Kubernetes cluster running in Azure. Our Kubernetes job will have a single master and many workers. Master and workers will run from the same Docker container image, differentiated by running a different startup script. The master will provide a Redis service for communicating to the workers. Result snapshots will be written out by the master to an Azure file share.

Working with a local container

The first step is to get things working locally. Creating and debugging our docker container locally allows us to quickly iterate. Once we have it working locally we can move onto deploying it at scale in a cloud service.

Creating the container

The workflow I’ve found most effective is to simply start with a base container in an interactive shell, and then run each command to set up your desired environment. I copy/paste the working command lines into my notes and use those notes to create the Dockerfile.

docker run -i -t ubuntu:xenial /bin/bash
> root@5fd5a12008ab:/#
apt-get update
apt-get dist-upgrade
apt-get install python3 python3-pip git build-essential
...

I ended up with a lengthy Dockerfile. You can read along with the Dockerfile on GitHub. The Dockerfile contains the following:

Start with an Ubuntu Xenial container
Update the OS and install build tools
Install the Roboschool dependencies
Install the Bullet3 physics engine
Finish Roboschool installation
Install an X11 virtual framebuffer
Install evolution-strategies-starter & dependencies
Install and configure redis, needed for the ES implementation
Add our task definition JSON file humanoid.json
Add startup scripts for master and worker nodes

Here’s a small part of the Dockerfile, setting up the ES starter.

# Install prereqs & evolution-strategies-starter
RUN apt-get -y install tmux screen vim
RUN apt-get -y install python3-click python3-redis python3-h5py
RUN pip3 install tensorflow==0.12.0
RUN git clone https://github.com/openai/evolution-strategies-starter.git
# Make it Roboschool compatible.
RUN sed -i "s|^    import gym$|    import gym, roboschool|g" /evolution-strategies-starter/es_distributed/es.py
RUN sed -i "s|^    import gym$|    import gym, roboschool|g" /evolution-strategies-starter/scripts/viz.py

Once we’ve finished editing our Dockerfile, we’ll create our docker image ala:

docker build -t gretkowski/kubeevo:v1 -f Dockerfile .

Testing and debugging the docker image locally

To test locally we’ll use docker to set up a test network, and then launch one master and one worker container. Chris Sainty has a good rundown of how to connect docker containers. We’ll use those instructions to create a docker network and connect our two containers over that network.

# start the master.
mkdir -p logdir
docker network create --driver=bridge testnet
docker run --net=testnet --name redis -i -t \
    -v `pwd`/logdir:/logdir gretkowski/kubeevo:v1  ./master-start.sh
# In a separate terminal window
docker run --net=testnet --name worker -i -t \
    gretkowski/kubeevo:v1 ./worker-start.sh

With the containers running we should be able to see the worker connect to the redis server on the master. We should see the worker pick up tasks from the master and push results up to the master.

Showing results

We can even use the container we created to show the results of our experiments. The docker container can display visualizations based on the latest checkpoint file.. As long as we allow the container access to open up a window on our X11 display.

# NOTE - this opens up your X display to the world!
xhost +
docker run --net=testnet -i -t -v `pwd`/logdir:/logdir --rm \
    -e DISPLAY=$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix \
    gretkowski/kubeevo:v1 /bin/bash
# Should now be in the container
ls /logdir/es_master*
# Note the name of the newest snapshot file for the next command...
python3 -m scripts.viz RoboschoolHumanoid-v1 \
    /logdir/es_master_201803051820/snapshot_iter00020_rew100.h5 \
    --stochastic

Once we are satisfied that everything is working as you expect, we’ll clean up our docker environment.

docker rm redis
docker rm worker
docker network rm testnet

Setting up the Azure Kubernetes cluster

The first step is to get an Azure account. Its easy to do, but you will need to have a credit card.

Get the trial account, right now you get $200 in free credit.
Upgrade it to the pay-as-you-go account unless you only want to use just a few very basic VM’s. You need to upgrade to get access to better VM’s.

Once you have your account established its a good time to familiarize yourself with the Microsoft Azure Portal.

Getting your Kubernetes cluster set up

The next step is to set up our Kubernetes cluster within Azure. For more verbose instructions check out the Azure walk-through for setting up Kubernetes.

First we’ll install the Azure CLI tool az and then install the Kubernetes CLI tool kubectl. If you are running something other than Xenial check out Installing the Azure CLI for instructions for your platform.

AZ_REPO=$(lsb_release -cs)
echo "deb [arch=amd64] https://packages.microsoft.com/repos/azure-cli/ $AZ_REPO main" | \
    tee /etc/apt/sources.list.d/azure-cli.list
apt-key adv --keyserver packages.microsoft.com --recv-keys \
    52E16F86FEE04B979B07E28DB02C46DF417A0893
apt-get install apt-transport-https
apt-get update && apt-get install azure-cli

We’ll verify the tools are working, by first logging in to Azure (requires a web browser) and then doing an acr list command

az login
az acr list

We need to create a resource group to contain all of our Kubernetes resources, this also makes it easier later to delete all our resources in one go, rather than having orphaned resources racking up your bill.

az group create --name myResourceGroup --location west-us-2
az acs create --orchestrator-type kubernetes \
    --resource-group myResourceGroup --name demoKubeCluster \
    --agent-count 1 --generate-ssh-keys

Next we’ll get kubectl installed and configured to talk to your cluster:

az acs kubernetes install-cli
az acs kubernetes get-credentials --resource-group demoKubeGroup \
    --name demoKubeCluster --file `pwd`/kubeconfig
export KUBECONFIG=`pwd`/kubeconfig
kubectl top node

Setting up a private container registry

We need to have a place to put our docker images that Kubernetes can retrieve them when deploying. To do this we’ll deploy a private container registry within Azure.

az acr create --resource-group demoKubeGroup \
    --name demoKubeRegistry --sku Basic
az acr login --name demoKubeRegistry

We’ll try to push an image into the registry, just to verify everything is working.

docker pull microsoft/aci-helloworld
docker tag microsoft/aci-helloworld demokuberegistry.azurecr.io/aci-helloworld:v1
docker push demokuberegistry.azurecr.io/aci-helloworld:v1
az acr repository list --name demoKubeRegistry --output table

Pushing your container image into the registry

Now is a good time to push our container to the registry. To do that we will tag the container with the remote registry hostname, and then push the container.

docker tag gretkowski/kubeevo:v1 demokuberegistry.azurecr.io/kubeevo:v1
docker push  demokuberegistry.azurecr.io/kubeevo:v1
az acr repository list --name demoKubeRegistry --output table

Azure file service setup

To get results out of our experiment we need our master node to write out checkpoints to a place which is easy for us to access. To accomplish this we will have our master node attach a volume which maps to an Azure File Share. We can then download checkpoint files by browsing through the Azure console.

The implementation is a mix of the one provided in Francisco Beltrao’s post and Karim Vaes’ post

az group list --output table
# Note the long one associated with your Kube cluster

az storage account create \
    --resource-group demoKubeGroup_demoKubeCluster_westus2 \
    --name kubeevologdir --location westus2 --sku Standard_LRS
az storage account show-connection-string \
--resource-group demoKubeGroup_demoKubeCluster_westus2 \
--name kubeevologdir

az storage share create --name kubeevologdir \
    --connection-string "THE_CONNECTION_STRING_FROM_THE_LAST_CMD"

az storage account keys list \
    --resource-group demoKubeGroup_demoKubeCluster_westus2 \
    --account-name kubeevologdir --query "[0].value" -o tsv
# Note the 'storage key' for the next commands.

# These values go into secrets/secrets/kubeevologdir.yaml
echo -n kubeevologdir | base64
echo -n {enter the storage key} | base64

# Now create our Kubernetes resources
kubectl create -f secrets/kubeevologdir.yaml
kubectl create -f pv/kubeevologdir.yaml
kubectl create -f pvc/kubeevologdir.yaml

Now kubectl get persistentvolumes and kubectl get persistentvolumeclaims should indicate the volume is ready to use.

Starting up our ‘master’ container

Read through the definition of our master master.yaml. Some of the highlights here:

It uses the container image we uploaded to our private registry
It defines a port 6379, opened up for Redis
It mounts the kubeevologdir volume on /logdir
The startup command is the ./master-start.sh script

We’ll start up our master, and monitor its progress:

kubectl -f deployments/master.yaml
kubectl get pods
kubectl describe pods master

Hopefully you end up with a pod in the Running state. If not the describe argument will often have some indication on why the pod failed to start.

You can check out the logs from the running container using the logs argument and the pod name, i.e. kubectl logs master-1285880246-1x227. To login to the pod to debug something locally use kubectl exec -it master-1285880246-1x227 -- /bin/bash .

Starting our workers

The master is running Redis, but to make that available to workers we need to tell Kubernetes that its a ‘service’ - which creates a DNS name the workers can reference.

 kubectl create -f services/redis.yaml

Next we can start up a couple of workers

kubectl create -f deployments/workers-2.yaml

Using kubectl logs should show the workers connecting to the master and pulling down tasks:

kubectl logs worker-3207115605-v0lwh
...
[2018-03-06 23:06:43,344 pid=34] [relay] Received task b'0'

Scaling up the cluster & adding more workers

Next lets scale up the number of nodes in our Kubernetes cluster, and then add several more workers. When we created the cluster we only added a single agent. We then only launched 2 worker containers. Here we’ll scale up to have a total of 4 agents, and 8 workers.

az acs show --name demoKubeCluster --resource-group demoKubeGroup
az acs scale --name demoKubeCluster --resource-group demoKubeGroup --new-agent-count 4
kubectl replace -f deployments/workers-8.yaml
kubectl get pods

Checking out progress, and grabbing results

If you run kubectl logs for your master you should see the experiment chugging along…

********** Iteration 2 **********
[2018-03-06 23:12:18,352 pid=29] Skipped 25 out of date results (1.15%)
----------------------------------
| EpRewMean           | 32.7     |
| EpRewStd            | 13.1     |
| EpLenMean           | 32.7     |
| EvalEpRewMean       | 44.2     |
| EvalEpRewStd        | 14.6     |
| EvalEpLenMean       | 39.3     |
| EvalPopRank         | 0.746    |
| EvalEpCount         | 6        |
| Norm                | 501      |
| GradNorm            | 0.701    |
| UpdateRatio         | 0.0737   |
| EpisodesThisIter    | 1e+04    |
| EpisodesSoFar       | 3.04e+04 |
| TimestepsThisIter   | 3.27e+05 |
| TimestepsSoFar      | 8.39e+05 |
| UniqueWorkers       | 8        |
| UniqueWorkersFrac   | 0.00368  |
| ResultsSkippedFrac  | 0.0115   |
| ObCount             | 3.69e+03 |
| TimeElapsedThisIter | 83.7     |
| TimeElapsed         | 765      |
----------------------------------
********** Iteration 3 **********

The experiment will write out checkpoint files every 20 iterations. To access the files the most expedient way is to browse through the Azure Portal. Go to ‘Storage Accounts’ then drill into kubeevologdir, ‘files’, kubeevologdir. You can then take those h5 checkpoint files and use the scripts.viz command line we used previously to look at the progress of the experiment in learning a solution.

Checkpoint Files in Azure

And finally, saving our bank balance

After we have finished using your Kubernetes cluster, we’ll be sure to clean it up - making sure it doesn’t continue to rack up big charges for VM’s and other Azure resources. A big plus is that Azure kept all our resources in a ResourceGroup so to clean everything up we can simply delete that group:

az group delete --name demoKubeGroup

Additional Resources

Kubernetes Cheatsheet

The Adventures of Greg DevOps Travel Make Lifehack