Kubernetes Storage Considerations for AI Workloads

This is a repost of my 3-part blog post from November 2018. We discuss storage options for AI and particularly deep learning and discuss how to avoid a few common pitfalls, esp. regarding small file support. The structure is as follows:

  1. Fundamentals of IBM Cloud Storage Solutions for Kubernetes including Cloud Object, File and Block Storage as well as Intel’s specialized vck driver
  2. Deploying Large, Data-Intensive AI Applications Using Kubernetes on IBM Cloud
  3. Performance optimization of deep learning training on large datasets consisting of small files

We will first introduce the basics of Kubernetes storage as well as the general procedures to provision it with concrete examples for IBM Cloud. Afterwards, we will take another concrete use case, dataset visualization, and show how to deploy an AI application to a Kubernetes cluster leveraging cloud storage. Finally, we will discuss the aforementioned case of supporting small files in deep learning training and show how to achieve at least a 36x speedup through data transformation and shared memory.

Fundamentals of IBM Cloud Storage Solutions

Falk Pollok, Parijat Dube

Introduction

With over 40,000 Github stars and adoption in cloud platforms from IBM Cloud Kubernetes Service, Amazon Elastic Container Service for Kubernetes (EKS) over Azure Kubernetes Service (AKS) to obviously Google Kubernetes Engine (GKE), Kubernetes has become one of the most prominent cloud technologies. Besides these public cloud examples it has similarly advanced in on-premise and hybrid cloud deployments from IBM Cloud Private (ICP) to GKE on-prem and Redhat’s OpenShift. Thus, it comes at no surprise that it is also heavily used within IBM Research where we have successfully adapted it for AI workloads. However, microservice architectures generally require processes and hence containers to be stateless which is for instance demanded in point 6 of the Twelve Factor App guidelines. Sending data over the network is one of the main bottlenecks to avoid in any data-centric workload from Big Data jobs such as Spark-based data science jobs to training in deep learning and deep reinforcement learning up to neural network visualization. Storage solutions in Kubernetes have seen lots of changes over recent versions leading to the current solution of Persistent Volume Claims and Flex Drivers as well as specialized storage drivers like Intel’s VCK. Rather than focusing on a historical perspective we will in this article follow a more hands-on approach and present different storage options as well as a brief future outlook on this evolving field.

We will first look at the three traditional storage options cloud object storage (COS), file storage (NFS) and block storage (ext4 via iSCSI) as the main topic of this article and then briefly touch upon AI-specific considerations and evolving technologies.

Basic Volume Provisioning Mechanisms in Kubernetes

Provisioning Volumes with Persistent Volumes and Persistent Volume Claims

There are two important objects for Kubernetes volumes, Persistent Volumes (PV) and Persistent Volume Claims (PVC), which facilitates a separation of concerns: Administrators can provision PVs and developers can request storage via PVCs. Volumes can either be provisioned statically by manually creating these objects or they can be provisioned dynamically through a storage class, in which case the developer only creates a PVC that refers to a specific storage class and Kubernetes will handle the provisioning dynamically through storage drivers.

Originally, volume plugins had to be in-tree, i.e. they had to be compiled with Kubernetes which was alleviated by so called Flex volumes which are the current state-of-the-art in productive used and will be used in this blog post. Flex volumes allow external volume plugins, but still generally require root node access since they require dependencies to be installed on the workers.

The Cloud Object Storage driver we discuss below, for instance, depends on s3fs binaries and will deploy them by launching a daemonset that runs one pod on each worker node that will then open a tunnel into the worker itself (which requires privileged access) to copy its binaries.

A better approach in the near future will be using CSI drivers which will run completely containerized and thus not depend on advanced privileges. Furthermore, it is an independent standard that also applies to other cloud orchestrators (COs) like Docker and Mesos and it will be used through the same Kubernetes primitives mentioned above (PVs, PVCs and storage classes), so most of this blog post series should be applicable even after the transition.

File Storage

File storage is the easiest to deploy, since it is supported by default — we do not need to install drivers and can directly provision a Persistent Volume Claim with it. When you list your storage classes (and grep for file storage in particular) you will find that there are different QoS classes:

$ kubectl get sc | grep ibmc-file
default ibm.io/ibmc-file 24d
ibmc-file-bronze ibm.io/ibmc-file 24d
ibmc-file-custom ibm.io/ibmc-file 24d
ibmc-file-gold ibm.io/ibmc-file 24d
ibmc-file-retain-bronze ibm.io/ibmc-file 24d
ibmc-file-retain-custom ibm.io/ibmc-file 24d
ibmc-file-retain-gold ibm.io/ibmc-file 24d
ibmc-file-retain-silver ibm.io/ibmc-file 24d
ibmc-file-silver ibm.io/ibmc-file 24d

The bronze, silver and gold classes are so called SoftLayer endurance storage (roughly meaning it determines IOPS by size and provides support for snapshots and replication) — you can find their exact specs in the storage class reference. Custom storage allows one to specify with fine granularity how many IOPS are required. The underlying hard disk type is then determined by the IOPS/GB ratio (<= 0.3 SATA, >0.3 SSD). Retain determines whether your files are kept when you delete the corresponding PVC.

Once you have determined your storage class you can create a PVC as follows:

kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfspvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 41Gi
storageClassName: ibmc-file-gold
EOF

We will first finish showing how to create PVCs for each storage type and then demonstrate how to mount them inside a pod.

Block Storage

Block storage is slightly more involved, but also very easy to deploy, since its helm charts are already provided in the IBM Cloud registry. As a result, we can just add the repository and run the following commands to obtain a new set of storage classes:

helm init
helm repo add ibm https://registry.bluemix.net/helm/ibm
helm repo update
helm install ibm/ibmcloud-block-storage-plugin

If you want to search all available helm charts, run helm search. We can then provision a PVC just like before:

kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: regpvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: ibmc-block-gold
EOF

Cloud Object Storage

S3, the Simple Storage Service, originated as Amazon’s fifth cloud product over a decade ago. According to SimilarTech it is used by over 200k websites, it is also the central storage component for Netflix (which developed S3mper to provide a consistent secondary index on top of an eventually consistent storage) as well as Reddit, Pinterest, Tumblr and others. Arguably, more important than the success of an individual product is its adoption as an API. With cloud object storage solutions from many different cloud vendors adapting it, it has been one of the first vendor-independent cloud standards. IBM’s Cloud Object Storage is S3 compatible and can thus be used with any S3-compatible tooling. In the following, we will setup a COS instance, present important tools and discuss how to comfortably support it within a Kubernetes cluster through FLEX drivers.

Setup of COS

Search for “Cloud Object Storage” in IBM Cloud, click on the “Object Storage” infrastructure service.

Click on “Create”. You can use default parameters.

Afterwards, you need to create new credentials. Please make sure to set {"HMAC":true} as inline configuration parameter. This will make sure that an access key ID and secret access key pair are created that we can use for a Watson Machine Learning manifest as well as managing our COS buckets with the AWS CLI.

Afterwards, you can switch to the Buckets tab and create a new bucket. Please make sure that this name needs to be globally unique.

Once a bucket exists, we can manually add files to it. The image depicts the way to do so via the GUI.

For example, we could add the MNIST dataset.

Cloud Object Storage Tooling

AWS CLI

Once you have a COS instance, you can access it through the S3 part of the AWS CLI. Besides the standard file commands cp (copy), ls (list), mv (move) and rm (remove) known from unixoid operating systems it allows us to sync files, make buckets (mb) and remove them (rb) [interested readers find the less used commands presign and website in the documentation]. In practice, this will look as follows:

  • Empty COS bucket
aws s3 rm --recursive --endpoint-url https://s3-api.us-geo.objectstorage.softlayer.net s3://<bucket_name>
  • List COS bucket content
aws s3 ls --endpoint-url https://s3-api.us-geo.objectstorage.softlayer.net s3://<bucket_name>
  • Copy file to COS (also works to download COS files if parameters are flipped)
aws s3 cp --endpoint-url https://s3-api.dal-us-geo.objectstorage.service.networklayer.com <local_file> s3://<target_bucket>
  • Upload all files from current directory to COS
aws s3 cp --endpoint-url https://s3-api.dal-us-geo.objectstorage.service.networklayer.com --recursive . s3://<target_bucket>

To specify the credentials to use you can define the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY either globally or just in front of the command, e.g.

AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY> AWS_SECRET_ACCESS_KEY= <AWS_SECRET_ACCESS_KEY> aws s3 ls --endpoint-url https://s3-api.us-geo.objectstorage.softlayer.net s3://<BUCKET_NAME>

Alternatively, you can modify ~/.aws/credentials to include different keys following the pattern

[<PROFILE_NAME>]
aws_access_key_id = <AWS_ACCESS_KEY>
aws_secret_access_key = <AWS_SECRET_ACCESS_KEY>

The entry where <PROFILE_NAME> is default is used if you do not specify a profile in the CLI command. Other entries can be used as follows:

aws s3 ls --endpoint-url https://s3-api.us-geo.objectstorage.softlayer.net --profile <PROFILE_NAME> s3://<BUCKET_NAME>

Mounting COS to file system via s3fs and goofys

Instead of always specifying all parameters and the exact bucket name and performing all operations through the AWS CLI, it is more convenient to mount the bucket as a folder onto the existing file system. This can be done via s3fs or goofys. To mount a bucket via goofys just create a file ~/.aws/credentials that contains the credentials like this:

[default]
aws_access_key_id = <ACCESS_KEY>
aws_secret_access_key = <SECRET_ACCESS_KEY>

Download goofys with wget http://bit.ly/goofys-latest and then mount the volume via

./goofys-latest --endpoint=https://s3-api.us-geo.objectstorage.softlayer.net <BUCKET_NAME> ~/testmount

(assuming you have created the mount target via mkdir ~/testmount and made goofys executable via chmod +x goofys-latest)

You can unmount the volume via sudo umount ~/testmount/

Alternatively, you can use s3fs. Create a credentials file ~/.cos_creds with:

<ACCESS_KEY>:<SECRET_ACCESS_KEY>

Make sure neither your group nor others have access rights to this file, e.g. via chmod o-rwx ~/.cos_creds You can then mount the bucket via

s3fs dlaas-ci-tf-training-data-us-standard ~/testmount -o passwd_file= ~/.cos_creds -o url=https://s3-api.us-geo.objectstorage.softlayer.net -o use_path_request_style

Note that s3fs can optionally provide extensive logging information:

s3fs dlaas-ci-tf-training-data-us-standard ~/testmount -o passwd_file= ~/.cos_creds -o dbglevel=info -f -o curldbg -o url=https://s3-api.us-geo.objectstorage.softlayer.net -o use_path_request_style &

More information about s3fs can e.g. be found here.

In simple test environments it might be sufficient to mount the folder as a host volume, e.g. into Minikube via minikube mount ~/testmount:/cosdata [Note: This will spawn a daemon that will keep running so either put it in background or continue in new terminal] To test you could then ssh into Minikube (minikube ssh) and list the files with ls /cosdata Then in turn you can mount the directory from within the Minikube VM into a pod via hostPath binding:

kubectl create -f proxymounttest.yml

with proxymounttest.yml:

apiVersion: v1
kind: Pod
metadata:
name: ubuntu
spec:
containers:
- name: ubuntu
image: ubuntu:14.04
imagePullPolicy: IfNotPresent
stdin: true
stdinOnce: true
tty: true
workingDir: /cosdata
volumeMounts:
- mountPath: /cosdata
name: host-mount
volumes:
- name: host-mount
hostPath:
path: /cosdata

Of course, you could achieve the same through a PVC. Both approaches lead to the result that you can exec into the pod (kubectl exec -it ubuntu -- /bin/bash) and run ls /cosdata to list the folder just like we did in the Minikube VM.

However, for Kubernetes clusters in production it is more desirable to properly mount volumes via drivers and hence we will discuss in the following how to use a Flex driver.

Cloud Object Storage: Driver Setup

Now that the Helm charts are available in IBM Cloud it suffices to use them to obtain storage classes for COS, i.e.

helm repo add stage https://registry.stage1.ng.bluemix.net/helm/ibm
helm repo update
helm fetch --untar stage/ibmcloud-object-storage-plugin
helm plugin install ibmcloud-object-storage-plugin/helm-ibmc
helm init
kubectl get pod -n kube-system | grep tiller
# Wait & check until state is Running
helm ibmc install stage/ibmcloud-object-storage-plugin –f
ibmcloud-object-storage-plugin/ibm/values.yaml

You can then list the new storage classes: kubectl get sc

Cloud Object Storage: Manual Driver Setup

IBM has released an open source COS plugin for Kubernetes. It is currently the most difficult to setup of all classes, but it is not unreasonable to expect that helm charts for COS are soon added to the IBM repository after which the setup will be as easy as block storage. Still, it is a good demonstration, since it shows how to load custom drivers into IBM Cloud. The manual setup needs custom images, so you need credentials for either an IBM Cloud container registry or for a private Docker registry. The steps below assume that you have set the environment variables DOCKER_REPO, DOCKER_REPO_USER, DOCKER_REPO_PASS and DOCKER_NAMESPACE to your registry credentials as well as docker namespace.

Let us afterwards clone and compile the driver:

git clone https://github.com/IBM/ibmcloud-object-storage-plugin.git
cd ibmcloud-object-storage-plugin/deploy/binary-build-and-deploy-scripts/
./build-all.sh

We then need to login to our registry and push the images we just built.

docker login --username=${DOCKER_REPO_USER} --password=${DOCKER_REPO_PASS} https://${DOCKER_REPO}docker tag ibmcloud-object-storage-deployer:v001 $DOCKER_REPO/$DOCKER_NAMESPACE/ibmcloud-object-storage-deployer:v001
docker tag ibmcloud-object-storage-plugin:v001 $DOCKER_REPO/$DOCKER_NAMESPACE/ibmcloud-object-storage-plugin:v001docker push $DOCKER_REPO/$DOCKER_NAMESPACE/ibmcloud-object-storage-deployer:v001
docker push $DOCKER_REPO/$DOCKER_NAMESPACE/ibmcloud-object-storage-plugin:v001

Now we need to deploy the plugin and driver. As a first step, we need to adapt the YAML descriptors to refer to our docker registry and namespace. Since we do this via sed and macOS does not use GNU sed by default, we use a variable CMD_SED to point to the right command. On macOS this assumes you installed GNU sed as gsed, e.g. via brew. We also need to create a docker-registry secret on the Kubernetes cluster so it can authenticate against the container registry and pull the images. Finally, we can deploy the plugin, provisioner and storage class.

operating_system=$(uname)
if [[ "$operating_system" == 'Linux' ]]; then
CMD_SED=sed
elif [[ "$operating_system" == 'Darwin' ]]; then
CMD_SED=gsed
fi
# Replace image tag in yaml descriptors to point to registry and namespace
$CMD_SED -i "s/image: \"ibmcloud-object-storage-deployer:v001\"/image: \"$DOCKER_REPO\/$DOCKER_NAMESPACE\/ibmcloud-object-storage-deployer:v001\"/g" deploy-plugin.yaml
$CMD_SED -i "s/image: \"ibmcloud-object-storage-plugin:v001\"/image: \"$DOCKER_REPO\/$DOCKER_NAMESPACE\/ibmcloud-object-storage-plugin:v001\"/g" deploy-provisioner.yaml
# Create secret, then deploy daemonset and plugin
kubectl create secret docker-registry regcred --docker-server=${DOCKER_REPO} --docker-username=${DOCKER_REPO_USER} --docker-password=${DOCKER_REPO_PASS} --docker-email=unknown@docker.io -n kube-system
kubectl create -f deploy-plugin.yaml
kubectl create -f deploy-provisioner.yaml
cd ..
kubectl create -f ibmc-s3fs-standard-StorageClass.yamlkubectl create -f ibmc-s3fs-standard-StorageClass.yaml

Provision Cloud Object Storage PVC and Test Pod

With the driver it is now easy to create a PVC, but unlike the previous approaches we need a secret to hold our COS credentials. Hence, we create the secret first:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Secret
type: ibm/ibmc-s3fs
metadata:
name: test-secret
namespace: default
data:
access-key: <AWS_ACCESS_KEY>
secret-key: <AWS_SECRET_ACCESS_KEY>
EOF

Please note that the credentials need to be Base64 encoded, so please encode them via echo -n "<SECRET>" | base64

We can then create the PVC itself:

kubectl apply -f - <<EOF
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: s3fs-test-ds-pvc
namespace: default
annotations:
volume.beta.kubernetes.io/storage-class: "ibmc-s3fs-standard"
ibm.io/auto-create-bucket: "true"
ibm.io/auto-delete-bucket: "true"
ibm.io/bucket: "<unique bucket_name>"
ibm.io/endpoint: "https://s3-api.us-geo.objectstorage.softlayer.net"
ibm.io/region: "us-standard"
ibm.io/secret-name: "test-secret"
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 40Gi
EOF

Now that all PVCs have been created we can create a pod that mounts all of them:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: s3fs-test-pod
namespace: default
spec:
containers:
- name: s3fs-test-container
image: anaudiyal/infinite-loop
volumeMounts:
- mountPath: "/cos"
name: s3fs-test-volume
- mountPath: "/block"
name: block-test-volume
- mountPath: "/nfs"
name: nfs-test-volume
volumes:
- name: s3fs-test-volume
persistentVolumeClaim:
claimName: s3fs-test-pvc
- name: block-test-volume
persistentVolumeClaim:
claimName: regpvc
- name: nfs-test-volume
persistentVolumeClaim:
claimName: nfspvc
EOF

AI-Specific Storage Solutions: Intel vck

Intel vck (formerly KVC — Kubernetes Volume Controller) is being developed by Intel AI and custom-tailored for AI workloads. It provides access to a diverse set of data sources through a Custom Resource Definition the inner workings of which have been outlined in a blog post by them. Instead of IBM Cloud we tried this in a local DIND Kubernetes cluster. In order to set it up execute the following:

kubectl create namespace vckns
kubectl config set-context $(kubectl config current-context) --namespace=vckns
git clone https://github.com/IntelAI/vck.git && cd vck
helm init
# Wait until kubectl get pod -n kube-system | grep tiller shows Running state
# Modify helm-charts/kube-volume-contoller/values.yaml to use valid tag from https://hub.docker.com/r/volumecontroller/kube-volume-controller/tags/
helm install helm-charts/kube-volume-controller/ -n vck --wait --set namespace=vckns
kubectl get crd
export AWS_ACCESS_KEY_ID=<aws_access_key>
export AWS_SECRET_ACCESS_KEY=<aws_secret_access_key>
kubectl create secret generic aws-creds --from-literal=awsAccessKeyID=${AWS_ACCESS_KEY_ID} --from-literal=awsSecretAccessKey=${AWS_SECRET_ACCESS_KEY}
kubectl create -f resources/customresources/s3/one-vc.yaml
# Content of resources/customresources/s3/one-vc.yaml:
apiVersion: vck.intelai.org/v1alpha1
kind: VolumeManager
metadata:
name: vck-example1
namespace: vckns
spec:
volumeConfigs:
- id: "vol1"
replicas: 1
sourceType: "S3"
accessMode: "ReadWriteOnce"
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- <SOME_K8S_WORKER_NODE_NAME>
capacity: 5Gi
labels:
key1: val1
key2: val2
options:
endpointURL: https://s3-api.us-geo.objectstorage.softlayer.net
awsCredentialsSecretName: aws-creds
sourceURL: "s3://<BUCKET_NAME>/"kubectl create -f resources/pods/vck-pod.yaml
# Content of resources/pods/vck-pod.yaml:
apiVersion: v1
kind: Pod
metadata:
name: vck-claim-pod
spec:
affinity:
# nodeAffinity and hostPath below were copied from output of
# kubectl get volumemanager vck-example1 -o yaml
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: vck.intelai.org/vckns-vck-example1-vol1
operator: Exists
volumes:
- name: dataset-claim
hostPath:
path: /var/datasets/vck-resource-<LONG_ID>
containers:
- image: busybox
command: ["/bin/sh"]
args: ["-c", "sleep 1d"]
name: vck-sleep
volumeMounts:
- mountPath: /var/data
name: dataset-claim

You can afterwards exec into the pod with kubectl exec -it vck-claim-pod sh and list the bucket content with ls /var/data.

Deploying Large, Data-Intensive AI Applications Using Kubernetes on IBM Cloud

Arunima Chaudhary, Falk Pollok, Hendrik Strobelt, Daniel Weidele

Kubernetes for Hosting Applications

The intent of this tutorial is to enable you to create applications that expose APIs and UIs using large data backends in custom formats (like HDF5, BAM,..), deploy them through Kubernetes with Cloud Object Storage. The simple application that you should be able to host on your Kubernetes cluster is a CIFAR10 image browser that shows for each instance if the classification of your model matches the ground truth. Here is a preview of the UI:

The sample application exposes a REST API that takes in image IDs and provides the image pixel data, ground truth label and prediction label. We use the CIFAR10 image dataset and the predictions are provided by a LeNet model trained using Watson Machine Learning.

Here is an architecture diagram of the application internals:

The following text describes how to deploy a sample application to a Kubernetes cluster.

Step 1: Download CIFAR10 to your machine and convert it to the HDF5 format

cd tutorial
python3 download_cifar.py

You can now either proceed to step 2 or, optionally, follow the instructions below to compute the model on Watson Machine Learning that are based on a Watson Studio tutorial.

Step 1.1: Create a Watson Machine Learning Instance

Step 1.2: Create Data and Results Buckets

Step 1.3: Upload CIFAR10 to the Data Bucket

$ aws --endpoint-url=https://s3-api.us-geo.objectstorage.softlayer.net --profile my_profile s3 cp cifar10/ s3://<bucket-name> --recursive

Step 1.4: Prepare a Watson Machine Learning Manifest

model_definition:
framework:
#framework name and version (supported list of frameworks available at 'bx ml list frameworks')
name: pytorch
version: 0.3
#name of the training-run
name: cifar10 in pytorch
#Author name and email
author:
name: John Doe
email: johndoe@in.ibm.com
description: This is running cifar training on multipple models
execution:
#Command to execute -- see script parameters in later section !!
command: python3 main.py --cifar_path ${DATA_DIR}
--checkpoint_path ${RESULT_DIR} --epochs 10
compute_configuration:
#Valid values for name - k80/k80x2/k80x4/p100/p100x2/v100/v100x2
name: k80
training_data_reference:
name: training_data_reference_name
connection:
endpoint_url: "https://s3-api.us-geo.objectstorage.service.networklayer.com"
aws_access_key_id: < from cloud portal >
aws_secret_access_key: < from cloud portal >
source:
bucket: < data bucket name >
type: s3
training_results_reference:
name: training_results_reference_name
connection:
endpoint_url: "https://s3-api.us-geo.objectstorage.service.networklayer.com"
aws_access_key_id: < from cloud portal >
aws_secret_access_key: < from cloud portal >
target:
bucket: < results bucket name >
type: s3

Step 1.5: Compress the Code

$ zip model.zip main.py utils.py models/*

Step 1.6: Train the Model in the Cloud

$ ibmcloud ml train model.zip manifest.yml

Step 1.7: Download model.ckpt from the Results Bucket and Place it in the Project’s Root Directory

Step 1.8: Please proceed with Step 3 & 4

Step 2: Setup and activate conda environment

Use Anaconda to create a virtual environment, here named k8st.

conda env create -f environment.yml
source activate k8st

Step 3: Run the application

Now you should be able to run the Flask server.

python server.py

Step 4: Test the application

Test your application at http://localhost:5001/images/?ids=0,1,2,3,4,5 which should return a JSON object.

You can also view the results through a UI at http://localhost:5001 .

Dockerize your code

To host the app on Kubernetes it has to be containerized, i.e. it has to be bundled into a Docker container. For this part of the tutorial you need to have Docker installed on your machine.

The Dockerfile describes how to build such a container. Once built, the container has to be pushed to a registry for Kubernetes to obtain it from.

Step 1: Create Dockerfile

Dockerfile can look like this:

FROM continuumio/miniconda3# Update package lists
RUN apt-get -y update
RUN apt-get -y upgrade
RUN apt-get -y install s3fsWORKDIR /usr/app# Build conda environment
RUN mkdir tmp && cd tmp
COPY tutorial/environment.yml .
RUN conda env create -f environment.yml
RUN cd .. && rm -rf tmp# Create data dir
RUN mkdir data# Copy secret file
COPY .passwd-s3fs .
RUN chmod 600 .passwd-s3fs# Copy full tutorial code
COPY tutorial .# Run instructions
CMD ["source activate k8st && exec python3 server.py"]
ENTRYPOINT ["/bin/bash", "-c"]
EXPOSE 5001

Step 2: Create a .passwd-s3fs file to store credentials for the Cloud Object Storage instance

Step 2.1: Create a Cloud Object Storage Instance

Step 2.2: Get the access credentials for the Cloud Object Storage instance

Step 2.3: Create a .passwd-s3fs file to store the access credentials obtained in Step 2.2

<aws_access_key_id>:<aws_secret_access_key>

Step 3: Build & Test the Container

From the main directory execute

docker build -t k8-tut/tutorial .

After the container is completely built, it can be tested. The following command mounts the tutorial/data directory into the container and runs the container by exposing port 5001 from the container to the local machine:

docker run -it -v "${PWD}/tutorial/data:/usr/app/data" -p "5001:5001" k8-tut/tutorial

Step 4: Create Namespaces and Upload Docker Container to Registry

Step 4.1: Install the IBM Cloud CLI

Step 4.2: Install the Container Registry plug-in

ibmcloud plugin install container-registry -r Bluemix

Step 4.3: Create the Namespace and Upload Docker Container

# Log in to your IBM Cloud account (IBMers add --sso)
ibmcloud login -a https://api.us-east.bluemix.net# Create namespace, e.g. "k8-tut"
ibmcloud cr namespace-add k8-tut# Tag the docker image
docker tag k8-tut/tutorial registry.ng.bluemix.net/k8-tut/tutorial# Push the image
docker push registry.ng.bluemix.net/k8-tut/tutorial

Getting Access to Kubernetes Cluster

Contact your resource administrator to make sure you have a Kubernetes cluster with admin access.

The follwing steps setup the kubectl CLI to work with your Kubernetes cluster. The sequence of commands is replicated from the Access section of a specific Kubernetes cluster on the IBM Cloud.

Target the IBM Cloud Container Service region in which you want to work against:

ibmcloud cs region-set <cluster-region>

Run the command to download the configuration files and set the Kubernetes environment accordingly:

ibmcloud cs cluster-config <cluster-name>

Set the KUBECONFIG environment variable. Copy the output from the previous command and paste it in your terminal. The command output should look similar to the following:

export KUBECONFIG=/Users/$USER/.bluemix/plugins/container-service/clusters/<cluster-name>/<kube-config.yml>

Verify that you can connect to your cluster and have admin access by listing your worker nodes:

kubectl get nodes

COS Driver Setup

Install kubernetes-helm and run the following commands to setup the COS driver:

helm repo add stage https://registry.stage1.ng.bluemix.net/helm/ibm
helm repo update
helm fetch --untar stage/ibmcloud-object-storage-plugin
helm plugin install ibmcloud-object-storage-plugin/helm-ibmc
helm init
kubectl get pod -n kube-system | grep tiller
# Check until state is Running
helm ibmc install stage/ibmcloud-object-storage-plugin –f ibmcloud-object-storage-plugin/ibm/values.yaml

You can then list the new storage classes:

kubectl get sc

Cloud Object storage

Use the following steps to create a Persistent Volume Claim (PVC) for Cloud Object Storage.

Step 1: Obtain the Base64 encoded credentials…

echo -n "<AWS_ACCESS_KEY>" | base64
echo -n "<AWS_SECRET_ACCESS_KEY>" | base64

…and create a secret with them:

kubectl apply -f -<<EOF
apiVersion: v1
kind: Secret
type: ibm/ibmc-s3fs
metadata:
name: test-secret
namespace: default
data:
access-key: <AWS_ACCESS_KEY>
secret-key: <AWS_SECRET_ACCESS_KEY>
EOF

Step 2: Upload cifar10_hdf5 files to a separate bucket:

aws --endpoint-url=https://s3-api.us-geo.objectstorage.softlayer.net --profile my_profile s3 cp tutorial/data/cifar10_hdf5/ s3://<bucket-name>/cifar10_hdf5 --recursive

Step 3: Request the PVC

Replace <bucket-name> with your values you chose in Step 2 and run the entire command. You can change the size of your request from 10Gi to any desired value.

kubectl apply -f - <<EOF
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: cos-pvc
namespace: default
annotations:
volume.beta.kubernetes.io/storage-class: "ibmc-s3fs-standard"
ibm.io/auto-create-bucket: "false"
ibm.io/auto-delete-bucket: "false"
ibm.io/bucket: "<bucket-name>"
ibm.io/endpoint: "https://s3-api.us-geo.objectstorage.softlayer.net"
ibm.io/region: "us-standard"
ibm.io/secret-name: "test-secret"
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
EOF

Create the Deployment

The ConfigMap created in previous step can now be fed to the deployment along with all the PVCs.

kubectl apply -f - <<EOF
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: tut-deploy
labels:
app: tut-deploy
spec:
replicas: 1
selector:
matchLabels:
app: tut-deploy
template:
metadata:
labels:
app: tut-deploy
spec:
containers:
- name: tut-deploy
image: registry.ng.bluemix.net/k8-tut/tutorial
ports:
- containerPort: 5001
imagePullPolicy: Always
volumeMounts:
- mountPath: "/usr/app/data"
name: s3fs-test-volume
volumes:
- name: s3fs-test-volume
persistentVolumeClaim:
claimName: cos-pvc
EOF

Scale your application

kubectl apply -f - <<EOF
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: drawnation-scaler
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: tut-deploy
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 50
EOF

Expose the service

Paid Cluster: Expose the service using an External IP and Loadbalancer

$ kubectl expose deployment tut-deploy --type LoadBalancer --port 5001 --target-port 5001

Free Cluster: Use the Worker IP and NodePort

$ kubectl expose deployment tut-deploy --type NodePort --port 5001 --target-port 5001

More details can be found at https://github.com/IBM-Cloud/get-started-python/blob/master/README-kubernetes.md

Access the application

Verify that the status of the pod is RUNNING

$ kubectl get pods -l app=tut-deploy

Standard (Paid) Cluster:

Identify your LoadBalancer Ingress IP using

$ kubectl get service tut-deploy

Access your application at http://<EXTERNAL-IP>:5001/

Free Cluster:

Identify your Worker Public IP using

$ ibmcloud cs workers <cluster-name>

Identify the Node Port using kubectl describe service get-started-python Access your application at http://<WORKER-PUBLIC-IP>:<NODE-PORT>/

(Optional) Create an Ingress to Access your Application at a Requested Hostname

kubectl apply -f - <<EOF
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: tut-ingress
spec:
rules:
- host: <HOSTNAME>
http:
paths:
- path: /
backend:
serviceName: tut-deploy
servicePort: 5001
EOF

Access your app at http://<HOSTNAME>/

Efficient training of Deep Learning jobs on the cloud with datasets consisting of many small files

Parijat Dube, Falk Pollok

Problem Statement

In this study we address the problem of how to efficiently train deep learning models on machine learning cloud platforms, e.g. IBM Watson Machine Learning, when the training dataset consists of a large number of small files (e.g. JPEG format) and is stored in an object store like IBM Cloud Object Storage (COS). As an example, we train a PyTorch model using the Oxford flowers dataset. Despite of Oxford flowers being a small dataset, it is representative of the problems that one will encounter with large datasets like Imagenet1K.

This work was initiated when the performance of training jobs with datasets of many small files stored on Cloud Object Storage (COS) was reported to be very poor. Both the initial set-up time (before the job starts running and producing logs) and the runtime of job (e.g. per epoch time) was increased. In particular, the problem was observed when running PyTorch models with ImageNet1K stored as single JPEG files on COS. The pytorch model code is available here: https://github.com/pytorch/examples/tree/master/imagenet.

In the following section we first explain how data loading and batching works in PyTorch to then add shared memory and elaborate how the number of workers and amount of available shared memory influence training time and performance. Finally, we migrate the dataset into a single file in the Hierarchical Data Format (HDF5) and remeasure performance impact which shows a >36x improvement. We discuss how this number could be further increased in the conclusion.

Steps in Training

There are two major steps involved in data loading and processing in the training phase. We need to first preprocess the data, then load the dataset and finally iterate over the images in it during training. PyTorch provides many tools to facilitate data pre-processing, loading, and iteration, in particular though the classes torch.utils.data.Dataset and torch.utils.data.Dataloader (see https://pytorch.org/docs/stable/data.html). Dataset is an abstract class which needs to be inherited by a custom Dataset class which, in turn, has to provide the __len__ and __getitem__ methods which need to be overridden:

  • len(dataset) returns the size of the dataset
  • getitem(dataset) supports the indexing such that dataset[i] returns the i-th sample

Torchvision already provides an implementation of Dataset for most popular custom dataset classes, cmp. https://pytorch.org/docs/stable/torchvision/datasets.html

All datasets in torchvision.datasets are subclasses of torch.utils.data.Dataset with preimplemented __getitem__ and __len__methods.

In the PyTorch code with ImageNet the torchvision.datasets class is invoked in line 119:

import torchvision.datasets as datasets
import torchvision.transforms as transforms
train_dataset = datasets.ImageFolder(
traindir,
transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize,
])
)

ImageFolder is a sub-class of Dataset provided by torchvision.datasets. There are various transforms that can be applied to the images when loading them in Dataset. The class torchvision.transforms.Compose(transform)takes a list of transform objects as input which shall be applied sequentially on the images in traindir when Dataset is sampled. Above we see four transforms on each image when creating the dataset:

  • RandomResizedCrop(224)
  • RandomHorizontalFlip()
  • ToTensor()
  • normalize After the train_dataset is instantiated, an instance of torch.utils.data.Dataloader is created in line 133:
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
num_workers=args.workers, pin_memory=True, sampler=train_sampler
)

The for loop in line 185 iterates over the training dataset.

for i, (input, target) in enumerate(train_loader):
# measure data loading time
data_time.update(time.time() - end)target = target.cuda(non_blocking=True)# compute output
output = model(input)
loss = criterion(output, target)# measure accuracy and record loss
prec1, prec5 = accuracy(output, target, topk=(1, 5))
losses.update(loss.item(), input.size(0))
top1.update(prec1[0], input.size(0))
top5.update(prec5[0], input.size(0))# compute gradient and do SGD step
optimizer.zero_grad()
loss.backward()
optimizer.step()# measure elapsed time
batch_time.update(time.time() - end)
end = time.time()

The Dataloader class combines a dataset and a sampler. It is an iterator which supports batching, shuffling, and parallel loading of data using multi-processing worker. If shuffle is true, the dataset is reshuffled at the beginning of any training epoch. Then a batch of images is loaded in memory using num_workers subprocesses, with num_workers=0 meaning the data will be loaded in the main process. Before a batch of images is loaded in memory, all the transforms in transforms.Compose() are applied to each image. Each of these transforms adds to the data loading time of the dataset. Furthermore, the time consumed by this step is linear in the number of images (batch size) to be read from COS, transformed, and loaded at each iteration. After the data is loaded there is a forward pass and a backward pass step.

To benchmark the performance of PyTorch on an image dataset, we first run main.py with the Oxford flowers dataset, which has 102 classes with 10 images per class, both for the training and validation set. The default model is resnet18. We instrumented main.py to get time for different stages:

  1. model_loading: Time to load the neural network model into GPU memory.
  2. data_preparation: Time create instances of the Dataset and Dataloaderclasses.
  3. epoch_time: Time to finish one training epoch. At the beginning of each training epoch Python restarts the train_loader iterator by calling train_loader.__iter__() which returns an object with a .next()method. Then Python calls the .next() method of this object in order to get the first and subsequent values used by the for loop. In each iteration, after data is loaded in memory, a forward and backward pass is executed. Thus, time for one iteration (batch_time) includes both data loading and compute time. epoch_time is a sum of batch_time over all the batches in one epoch.
  • data loading: Time to sample images for one iteration using the DataLoader iterator as well as loading them into memory.
  • batch_time: Time to load data for one iteration as well as to compute forward and backward pass.

Performance using dataset with JPEG files

We first quantify the performance of PyTorch with images as JPEG files. Each job is run for 20 epochs in a Kubernetes pod with 1 Nvidia Tesla P100 GPU, 8 CPUs and 24GiB of memory. The dataset is stored in a COS bucket which is locally mounted on the pod. We first tried to run main.py with default parameters using the command:

python main-timing.py --epochs 20 --batch-size 64 /mnt/oxford-flowers

The job crashed after the first iteration with the following error:

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Epoch: [0][0/16] Time 4.430 (4.430) Data 3.707 (3.707) Loss 6.9139 (6.9139) Prec@1 0.000 (0.000) Prec@5 0.000 (0.000)
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "main-timing.py", line 333, in <module>
main()
File "main-timing.py", line 166, in main
train(train_loader, model, criterion, optimizer, epoch)
File "main-timing.py", line 205, in train
for i, (input, target) in enumerate(train_loader):
File "/opt/conda/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 280, in __next__
idx, batch = self._get_batch()
File "/opt/conda/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 259, in _get_batch
return self.data_queue.get()
File "/opt/conda/lib/python2.7/Queue.py", line 168, in get
self.not_empty.wait()
File "/opt/conda/lib/python2.7/threading.py", line 340, in wait
waiter.acquire()
File "/opt/conda/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 178, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 170) is killed by signal: Bus error.

Next, we changed the set num_workers to 0 and ran the following command:

python main_jpg-timing.py --epochs 20 --workers 0 --batch-size 64 /mnt/oxford-flowers

This time the job ran till completion. The instrumented code main_jpg-timing.py is simply main.py with hooks to get time for each step and can be downloaded from: https://github.ibm.com/dl-res/autolauncher/blob/master/main_jpg-timing.py

The table below shows total execution time, average epoch_time, average batch_time, data_loading and average GPU utilization for each run. Also shown are the top-1 and top-5 accuracy after each training epoch. The accuracy is calculated on the validation set which also has 1020 images (102 classes with 10 images each). Note that total_execution_time can be approximated as:

total_execution_time = model_loading + data_preparation + average_epoch_time * number_epochs

So, we can only run 0 workers configuration with this dataset. The average GPU utilization is extremely low (<2%) and the job runtime is 3033.43 sec. The code supports multiprocessing during data loading part when workers > 0. On checking the shared memory of the pod it turned out to be only 64M (run df -h inside the pod). We next increased the shared memory of the pod by adding

spec:
volumes:
- name: shm
emptyDir:
medium: Memory
containers:
- image: pytorch/pytorch:0.4.1-cuda9-cudnn7-devel
volumeMounts:
- mountPath: /dev/shm
name: shm

in the pod deployment YAML (with shared memory mount) and deployed the pod using kubectl create.

Now we can run the job with multiple workers. The table below summarizes the results:

batch size = 64, epochs = 20, number of batches = 16

As we increase the number of worker threads the av_epoch_time gets smaller and hence the job total_exec_time. For example, running with 32 workers, we see a 91% drop (2917.70 to 269.6) in job execution time (compared to running with 0 workers) and 10.5x increase in average GPU utilization (1.46% to 15.30%).

Next we look at average epoch time. Observe that

average_epoch_time = average_train_time + average_test_time
average_train_time = average_train_batch_time * number_train_batches_per_epoch
average_test_time = average_test_batch_time * number_test_batches_per_epoch
train_batch_time = train_data_loading_time + train_compute_time
test_batch_time = test_data_loading_time + test_compute_time

The table below shows the decomposition of average_epoch_time into average_train_time and average_test_time. Both train and test time are equally affected by the increase in workers. While the compute time (train_compute_time and test_compute_time) is not affected by an increase in workers, the data loading time in both train and test phase significantly reduces with more workers.

shared memory = 32G, batch size = 64, epochs = 20, number of batches = 16

Note that these numbers are all from one run for each configuration. For more accurate estimates of time one needs to run several jobs (for each configuration) and average out the time.

Performance using HDF5 dataset

We next converted our JPEG dataset into an HDF5 file. The code to prepare an HDF5 file from JPEG files is here:

with h5py.File('dataset.hdf5', 'w') as hdf5_file:
...
hdf5_dataset = hdf5_file.create_dataset(
name=str(current_name),
data=image,
shape=(image_height, image_width, image_channels),
maxshape=(image_height, image_width, image_channels),
compression="gzip",
compression_opts=9
)

For PyTorch to be able to load data from an HDF5 file a new subclass of the Dataset class, which can return images in HDF5 file as tensors, is needed. The code for this can be found in main-hdf5-timining.py which is available here: https://github.ibm.com/dl-res/autolauncher/blob/master/main_hdf5-timing.py

We first ran with default shared memory settings for 0 workers:

python main_hdf5-timing.py --epochs 20 --workers 0 --batch-size 64 /mnt/oxford-flowers

This time the job ran until completion. The instrumented code main_hdf5-timing.py can be downloaded from:

Next when we try to run with workers > 0, the job again crashed with same insufficient shared memory (shm) error as we got before with the JPEG dataset. Again, we deployed a pod with increased shared memory and ran jobs with multiple workers. The jobs ran fine for small number of workers (0,1). The table below shows the results from these runs.

shared memory = 126G , batch size = 64, epochs = 20, number of batches = 16

Conclusion

The performance of training jobs depends on the storage and the file format. We found that with COS, datasets in a single HDF5 file perform much better (high GPU utilization and reduced runtime) compared to datasets consisting of multiple small JPEG files, e.g., with one worker we achieved a gain of ~28x. The performance should further improve if we can run multiple workers with HDF5 files which we are currently investigating. The goal is to prevent GPU starvation (achieve greater than 95% utilization) by increasing the number of workers. The findings in this study are based on the Oxford flowers dataset but are also applicable to larger image datasets and most likely other modalities as well.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s