Azure GPUs, Jupyter and Machine Learning

I’m a big advocate of the cloud and it’s ability to provide just enough resources ad hoc. You can use whatever you want, and pay for it just when using it. In machine learning there are services such as Google’s ML Engine or Azure’s upcoming Batch AI but during development, data preprocessing etc sometimes you want immediate iterative processes. In these cases, you can’t go past a Jupyter notebook and in this case, running that on a VM. In this post, I’ll outline how I’ve setup such an environment in Azure, focusing on ability to build it up and tear it down via the CLI and using a cheaper VM during development and an easy jumpt to then being able to run it on a GPU machine once things are running smoothly.

Azure VMs

For data science/ML there can be a lot of dependencies. If you start getting to a GPU machine then there’s also all of the CUDA and other GPU installs to take care of. Luckily Microsoft publishes a Data Science Virtual Machine Image with all of this preinstalled. We will be using this as our image. If you don’t have the Azure CLI, you can follow the Install guide to get that onto your local machine. With that, we will create a few other resources we will use: a resource group, a storage account and a file share

RG_NAME=batch-rg
AZURE_STORAGE_ACCOUNT=myaccount
AZURE_STORAGE_KEY=

az group create -l westus2 -n $RG_NAME

az storage account create -g $RG_NAME -n $AZURE_STORAGE_ACCOUNT -l westus2 --sku Standard_LRS

AZURE_STORAGE_KEY=$(az storage account keys list --account-name ${AZURE_STORAGE_ACCOUNT} --resource-group ${RG_NAME} | head -n1 | awk '{print $3}')

az storage share create \
    --name machinelearning

Next we will create our VM, open port 8888 for Jupyter and schedule it to shutdown at 7pm local time. One thing to note is the size parameter: we are using Standard_NC6 here for our gpu machine, but I’ve changed this to Standard_DS1_v2 for the other machine I have for development/preprocessing

VM_NAME=my-name
az vm create \
    --name "${VM_NAME}-gpu-vm" \
    --admin-username $USER \
    --resource-group $RG_NAME \
    --image microsoft-ads:linux-data-science-vm-ubuntu:linuxdsvmubuntu:latest \
    --size Standard_NC6 \
    --ssh-key-value ~/.ssh/id_rsa.pub \
    --storage-sku Standard_LRS \
    --public-ip-address-dns-name "${VM_NAME}-gpu-vm" \
    --nsg "${VM_NAME}-gpu-nsg" \
    --vnet-name "${RG_NAME}-gpu-vnet" \
    --os-disk-name "${VM_NAME}-gpu-osdisk"

az vm open-port \
    --name "${VM_NAME}-gpu-vm" \
    --resource-group $RG_NAME \
    --port 8888

schedule-shutdown()
{
    VM_RESOURCE_ID=$(az resource show -g batch-rg --resource-type Microsoft.Compute/virtualMachines -n ${1} | awk '{ print $1 }')
    SHUTDOWN_PROPERTIES='{ "taskType": "ComputeVmShutdownTask", "dailyRecurrence": { "time": "1900" }, "timeZoneId": "W. Australia Standard Time", "targetResourceId": "'${VM_RESOURCE_ID}'" }'
    az resource create \
        -l westus2 \
        -n "shutdown-computevm-${1}" \
        -g batch-rg \
        -p "${SHUTDOWN_PROPERTIES}" \
        --resource-type microsoft.devtestlab/schedules
}

schedule-shutdown "${VM_NAME}-gpu-vm"

The schedule-shutdown method was the worst of this as there isn’t a parameter to enable auto shutdown or CLI function. We are creating a resource by it’s raw type. I found the appropriate properties JSON shape via looking at an existing one at resources.azure.com

Docker hosting Jupyter

Firstly, the azure file share: there is a guide for Use Azure files with linux but the commands I used were:

sudo mkdir -p /afs
AZURE_FILES_SMB_CONNECTION="//${AZURE_STORAGE_ACCOUNT}.file.core.windows.net/machinelearning /afs cifs vers=3.0,username=${AZURE_STORAGE_ACCOUNT},password=${AZURE_STORAGE_KEY},dir_mode=0777,file_mode=0777,serverino"
sudo bash -c 'echo "'"${AZURE_FILES_SMB_CONNECTION}"'" >> /etc/fstab'
sudo mount -a

Now we have our file share locally, and finally to run our notebook server inside a docker container:

sudo docker pull tensorflow/tensorflow:latest-gpu-py3

nvidia-docker run -dit \
    -p 8888:8888 \
    -v /afs:/data \ 
    -v $PWD:/notebooks \
    -e "PASSWORD=my-jupyter-password" \
    --restart unless-stopped \
    --name tf \
    tensorflow/tensorflow:latest-gpu-py3

We are:

  • --dit Run as a daemon and interactive
  • -p 8888:8888 to expose port 8888 from the container to 8888 on the host
  • Mounting our azure file share from /afs to /data inside the container
  • Mounting our current directory to /notebooks in the container. This is the working directory for Jupyter running in the container
  • Providing a PASSWORD environment variable which will be used as the Jupyter password. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/docker/jupyter_notebook_config.py
  • Have the container start with the machine after restart unless we specifically stop it
  • Run the latest tensorflow python 3 container with GPU

If running CPU only change nvidia-docker to docker and tensorflow/tensorflow:latest-gpu-py3 to tensorflow/tensorflow:latest-py3 Now visiting http://${VM_NAME}-gpu-vm.westus2.cloudapp.azure.com:8888 in the browser brings us to our password protected notebook :)