I worked on IBM’s Fabric for Deep Learning (FfDL) which was to the best of my knowledge the first cloud-based deep learning training solution and which received InfoWorld’s Best of Open Source (BOSSIE) award in 2018. Let us explore what an Enterprise-grade training service for neural networks looks like.

While the deep learning world moves quickly and today you would probably use Kubeflow and the frameworks built upon it like mlflow and OpenDataHub, it is still insightful to take a look back at FfDL: In an Enterprise context it is not enough to just train models like you would in most academic settings. Neural models can be prone to adversarial attacks which they need to be hardened against, you need to execute bias detection to make sure they are fair and you might need to employ explainability techniques since in many domains models are unacceptable that cannot properly justify their decisions (e.g., in medicine, law, finance). Furthermore, it might be necessary to compress them (e.g., for use on phones or embedded systems), quantify their uncertainty etc. But there are also questions how to stay framework agnostic, how to speed up distributed training, how to continuously retrain models (esp. since deep learning is prone to catastrophic forgetting), how to do hyperparameter optimization and automatic machine learning and how to share models via model marketplaces. Furthermore, you need to track changes to the model including provenance, model drift and collect crucial characteristics in factsheets which serve a similar purpose to nutrition labels for food. FfDL was probably the first open framework that addressed all of these issues comprehensively. Let’s visit these aspects one-by-one.

Deep Learning Frameworks and Platforms

Let us start with the obvious: There are many deep learning frameworks out there. A few of the more prominent ones are:

Caffe (BVLC) and Caffe 2.0
Chainer
CNTK
CoreML
DL4J
DyNet, Github, Docs
fast.ai, Docs, Course
JAX
Keras, source
Apache MXNet
PaddlePaddle (PArallel Distributed Deep Learning)
PyTorch
Tensorflow

This list could be extended significantly. In order to remain vendor neutral it is thus necessary to build and maintain many different pods to cover each of the relevant versions for each of the relevant frameworks. [Usually, public images are not sufficient for this – someone might try to infiltrate your infrastructure from non-hardened images or start bitcoin mining among many risks.]

Basic Architecture

The general architecture of the system looks like this:

There is a trainer service which tracks individual model training jobs, a lifecycle manager that controls the provisioning of training pods on top of Kubernetes, you need a training data service, since your training, test and validation sets needs to be loaded from S3/COS and dropped the moment they are no longer needed, but you also need NFS mounts for operative storage. Then there is the actual learner pod which is specific for the framework used and there are both metrics collectors which push into Prometheus as well as log collectors which push via fluentd into Elasticsearch with Kibana on top. For distributed training you can choose between parameter servers, Horovod and PyTorch distribution. We worked with the Horovod team to have Horovod 1.0 support on launch day and also integrated with other solutions like H2O and seldon.

DL Compilers

While we did not put too much emphasis on it in the early days, nowadays you should also look at DL compilers to significantly speed up your models and target specific platforms.

Glow Compiler by Facebook
Google XLA
Intel nGraph
Latte
Nvidia TensorRT
OctoML (on top of Apache TVM)
Open Neural Network Compiler (ONNC)
PyTorch Glow
PlaidML
TACO – The Tensor Algebra Compiler
Tensorflow Multi-Level Intermediate Representation, was originally here
Tensor Comprehensions paper
Tiramisu, paper
TVM, NNVM
Libraries like NNPACK, cuDNN, hipDNN/MIOpen etc.

A good example for this is how Amazon SageMaker Neo leverages Apache TVM.

Platforms

Nowadays, there is a plethora of AI platforms to choose from, e.g.,

Acumos AI
AI Layer
Airbnb BigHead & data management platform Zipline & metric platform Minerva
Algorithmia MLOps Platform
Alibaba Machine Learning Platform for AI (PAI 2.0) and Alibaba Cloud Intelligence Brain
Apple Overton
Arya AI
AWS SageMaker (incl. SageMaker Ground Truth SageMaker Neo SageMaker RL, also this )
Azure ML Studio
Baidu Brain
Bonsai
Data Robot
Deep Cognition
DoorDash ML Platform
eBay Krylov
Facebook FBLearner
Flipkart Hunch
FloydHub
Gojek’s ML Platform
Google Vertex AI, also see Google Kubeflow & TFX & Google AutoML
Groupon Flux
H2O
IBM Watson Studio and Watson Machine Learning
Iguazio (also see this )
Inuit ML Platform (based on Sagemaker, Argo Workflows, GitOps)
KNIME
Kubeflow
LinkedIn Pro-ML (Productive Machine Learning)
Lyft Flyte
Meta AI Looper
Microsoft OpenPAI, also see Microsoft DL Workspace (DLTS) and Azure ML
mlflow (Github)
Netflix Metaflow (and model lifecycle management platform Runway; also see this)
NVIDIA TAO
OpenAI Rapid
OpenDataHub (has ties to RHOCP)
Oracle AI
Pachyderm
Pinterest ML Platform
Polyaxon
Prowler.io
RapidMiner
Emerging Ray-Based Platforms (see Operator First Ray demo and CodeFlare)
Stripe Railyard – Platform for Model Training
Apache Submarine
Swiftstack
Tensorflow Extended (TFX) (also see this and Spotify’s TFX-Based ML Platform)
TransmogrifAI
Twitter DeepBird
Uber Michelangelo (also see this and this)
Valohai (also see this )
Wix Machine Learning Platform
Yelp ML Platform

I might write another post in the future comparing the basic architectures. While FfDL’s design is still relevant, it did not have its own inference component, feature store or data catalog, so looking at architectures like Lyft Flyte and AirBnb BigHead is a worthwhile exercise:

Toolboxes

As aforementioned, a significant part of the value proposition of Enteprise model training goes beyond mere training, but supporting the entire lifecycle of models and ensuring their quality from explainability over debiasing to adversarial hardening, compression, uncertainty quantification and similar techniques. The following toolkits help with this:

Aequitas – Bias and Fairness Audit Toolkit
IBM Adversarial Robustness Toolbox, Blog, Paper for adversarial hardening
IBM AI Fairness 360 (AIF360), Demo, Paper, Docs for bias detection
IBM AIX360, Demo – for explainability
IBM Uncertainty Quantification Toolkit (UQ 360) – uncertainty quantification
Intel Neural Network Distiller – for neural network compression
Qualcomm AI Model Efficiency Toolkit
XAI – An eXplainability toolbox for machine learning

Hyperparameter Optimization (HPO)

Similarly, it can be quite tedious and require a lot of practical experience to pick hyperparameters. For a platform it is desirable to offer more automated and rigorous solutions to the problem. Some HPO frameworks are:

Bayesian Optimization
DL4J Arbiter
GPyOpt (stale!)
HPOlib2 (stale!)
hyperopt
katib
Optuna, paper
Polyaxon Optimization Engine
Ray Tune
scikit-optimize / skopt – library to minimize expensive black-box functions
SHERPA
SigOpt
Spearmint (stale!)
Talos

AutoML & Neural Architecture Search (NAS)

Similarly, oftentimes the AI system can take care of the entire model creation. In order to do so, one needs AutoAI as well as Neural Architecture Search (NAS) to find neural network topologies. The University of Freiburg maintains the great page about the underlying techniques at automl.org. A few additional tools worth looking at are:

Model and Data Catalogs

Another aspect of FfDL’s ecosystem was that there were model marketplaces that allowed you to put a standardized interface on your model and store them in a searchable catalog. Ours were the Model Asset eXchange (MAX) and OpenAIHub.

Model Zoos & Catalogs

Again, nowadays the concept is really caught on and there are lots and lots of model market places to choose from including commercial offerings:

Primarily Commercial Model Marketplaces

Data Catalogs

The same holds for data catalogs such as:

Awesome Public Datasets – see esp. PublicDomains and SearchEngines
IBM Watson Knowledge Catalog
Data.gov
DL4J Open Datasets
EPA Dataset Gateway
Google Dataset Search
US Department of Commerce
World Bank Data Catalog

Model Formats

Furthermore, some standard formats for exchanging models have emerged such as:

Neural Network Exchange Format (NNEF)
ONNX, IBM Research contributed Tensorflow Backend for ONNX
Predictive Model Markup Language (PMML)
Portable Format for Analytics (PFA)

Feature Stores

One category that we did not have on our radar and that has become crucial since is the topic of feature stores like AWS SageMaker Feature Store, Databricks Feature Store, DoorDash Feature Store, Google Feast (Google Blog), Hopsworks, and others. They allow to not only store features, but also automate their computation, share them across the entire lifecycle as well as different pipelines and more. This makes them essential to MLOps which, however, is a world in itself and thus a topic for another post.

Enterprise Model Training Exemplified Through the Fabric for Deep Learning (FfDL)

Deep Learning Frameworks and Platforms

Basic Architecture

DL Compilers

Platforms

Toolboxes

Hyperparameter Optimization (HPO)

AutoML & Neural Architecture Search (NAS)

Model and Data Catalogs

Model Zoos & Catalogs

Primarily Commercial Model Marketplaces

Data Catalogs

Model Formats

Feature Stores

Like this:

Related

Leave a Reply

Leave a ReplyCancel reply

Follow me on Twitter

Pages

Categories

Archive

Deep Learning Frameworks and Platforms

Basic Architecture

DL Compilers

Platforms

Toolboxes

Hyperparameter Optimization (HPO)

AutoML & Neural Architecture Search (NAS)

Model and Data Catalogs

Model Zoos & Catalogs

Primarily Commercial Model Marketplaces

Data Catalogs

Model Formats

Feature Stores

Share this:

Like this:

Related

Leave a Reply

Leave a ReplyCancel reply

Follow me on Twitter

Pages

Categories

Archive

Discover more from Cognitive Architect