Enterprise Model Training Exemplified Through the Fabric for Deep Learning (FfDL)
I worked on IBM’s Fabric for Deep Learning (FfDL) which was to the best of my knowledge the first cloud-based deep learning training solution and which received InfoWorld’s Best of Open Source (BOSSIE) award in 2018. Let us explore what an Enterprise-grade training service for neural networks looks like.
While the deep learning world moves quickly and today you would probably use Kubeflow and the frameworks built upon it like mlflow and OpenDataHub, it is still insightful to take a look back at FfDL: In an Enterprise context it is not enough to just train models like you would in most academic settings. Neural models can be prone to adversarial attacks which they need to be hardened against, you need to execute bias detection to make sure they are fair and you might need to employ explainability techniques since in many domains models are unacceptable that cannot properly justify their decisions (e.g., in medicine, law, finance). Furthermore, it might be necessary to compress them (e.g., for use on phones or embedded systems), quantify their uncertainty etc. But there are also questions how to stay framework agnostic, how to speed up distributed training, how to continuously retrain models (esp. since deep learning is prone to catastrophic forgetting), how to do hyperparameter optimization and automatic machine learning and how to share models via model marketplaces. Furthermore, you need to track changes to the model including provenance, model drift and collect crucial characteristics in factsheets which serve a similar purpose to nutrition labels for food. FfDL was probably the first open framework that addressed all of these issues comprehensively. Let’s visit these aspects one-by-one.
Deep Learning Frameworks and Platforms
Let us start with the obvious: There are many deep learning frameworks out there. A few of the more prominent ones are:
- Caffe (BVLC) and Caffe 2.0
- Chainer
- CNTK
- CoreML
- DL4J
- DyNet, Github, Docs
- fast.ai, Docs, Course
- JAX
- Keras, source
- Apache MXNet
- PaddlePaddle (PArallel Distributed Deep Learning)
- PyTorch
- Tensorflow
This list could be extended significantly. In order to remain vendor neutral it is thus necessary to build and maintain many different pods to cover each of the relevant versions for each of the relevant frameworks. [Usually, public images are not sufficient for this – someone might try to infiltrate your infrastructure from non-hardened images or start bitcoin mining among many risks.]
Basic Architecture
The general architecture of the system looks like this:

There is a trainer service which tracks individual model training jobs, a lifecycle manager that controls the provisioning of training pods on top of Kubernetes, you need a training data service, since your training, test and validation sets needs to be loaded from S3/COS and dropped the moment they are no longer needed, but you also need NFS mounts for operative storage. Then there is the actual learner pod which is specific for the framework used and there are both metrics collectors which push into Prometheus as well as log collectors which push via fluentd into Elasticsearch with Kibana on top. For distributed training you can choose between parameter servers, Horovod and PyTorch distribution. We worked with the Horovod team to have Horovod 1.0 support on launch day and also integrated with other solutions like H2O and seldon.
DL Compilers
While we did not put too much emphasis on it in the early days, nowadays you should also look at DL compilers to significantly speed up your models and target specific platforms.
- Glow Compiler by Facebook
- Google XLA
- Intel nGraph
- Latte
- Nvidia TensorRT
- OctoML (on top of Apache TVM)
- Open Neural Network Compiler (ONNC)
- PyTorch Glow
- PlaidML
- TACO – The Tensor Algebra Compiler
- Tensorflow Multi-Level Intermediate Representation, was originally here
- Tensor Comprehensions paper
- Tiramisu, paper
- TVM, NNVM
- Libraries like NNPACK, cuDNN, hipDNN/MIOpen etc.
A good example for this is how Amazon SageMaker Neo leverages Apache TVM.
Platforms
Nowadays, there is a plethora of AI platforms to choose from, e.g.,
- Acumos AI
- AI Layer
- Airbnb BigHead & data management platform Zipline & metric platform Minerva
- Algorithmia MLOps Platform
- Alibaba Machine Learning Platform for AI (PAI 2.0) and Alibaba Cloud Intelligence Brain
- Apple Overton
- Arya AI
- AWS SageMaker (incl. SageMaker Ground Truth SageMaker Neo SageMaker RL, also this)
- Azure ML Studio
- Baidu Brain
- Bonsai
- Data Robot
- Deep Cognition
- DoorDash ML Platform
- eBay Krylov
- Facebook FBLearner
- Flipkart Hunch
- FloydHub
- Gojek’s ML Platform
- Google Vertex AI, also see Google Kubeflow & TFX & Google AutoML
- Groupon Flux
- H2O
- IBM Watson Studio and Watson Machine Learning
- Iguazio (also see this)
- Inuit ML Platform (based on Sagemaker, Argo Workflows, GitOps)
- KNIME
- Kubeflow
- LinkedIn Pro-ML (Productive Machine Learning)
- Lyft Flyte
- Meta AI Looper
- Microsoft OpenPAI, also see Microsoft DL Workspace (DLTS) and Azure ML
- mlflow (Github)
- Netflix Metaflow (and model lifecycle management platform Runway; also see this)
- NVIDIA TAO
- OpenAI Rapid
- OpenDataHub (has ties to RHOCP)
- Oracle AI
- Pachyderm
- Pinterest ML Platform
- Polyaxon
- Prowler.io
- RapidMiner
- Emerging Ray-Based Platforms (see Operator First Ray demo and CodeFlare)
- Stripe Railyard – Platform for Model Training
- Apache Submarine
- Swiftstack
- Tensorflow Extended (TFX) (also see this and Spotify’s TFX-Based ML Platform)
- TransmogrifAI
- Twitter DeepBird
- Uber Michelangelo (also see this and this)
- Valohai (also see this)
- Wix Machine Learning Platform
- Yelp ML Platform
I might write another post in the future comparing the basic architectures. While FfDL’s design is still relevant, it did not have its own inference component, feature store or data catalog, so looking at architectures like Lyft Flyte and AirBnb BigHead is a worthwhile exercise:

Toolboxes
As aforementioned, a significant part of the value proposition of Enteprise model training goes beyond mere training, but supporting the entire lifecycle of models and ensuring their quality from explainability over debiasing to adversarial hardening, compression, uncertainty quantification and similar techniques. The following toolkits help with this:
- Aequitas – Bias and Fairness Audit Toolkit
- IBM Adversarial Robustness Toolbox, Blog, Paper for adversarial hardening
- IBM AI Fairness 360 (AIF360), Demo, Paper, Docs for bias detection
- IBM AIX360, Demo – for explainability
- IBM Uncertainty Quantification Toolkit (UQ 360) – uncertainty quantification
- Intel Neural Network Distiller – for neural network compression
- Qualcomm AI Model Efficiency Toolkit
- XAI – An eXplainability toolbox for machine learning
Hyperparameter Optimization (HPO)
Similarly, it can be quite tedious and require a lot of practical experience to pick hyperparameters. For a platform it is desirable to offer more automated and rigorous solutions to the problem. Some HPO frameworks are:
- Bayesian Optimization
- DL4J Arbiter
- GPyOpt (stale!)
- HPOlib2 (stale!)
- hyperopt
- katib
- Optuna, paper
- Polyaxon Optimization Engine
- Ray Tune
- scikit-optimize / skopt – library to minimize expensive black-box functions
- SHERPA
- SigOpt
- Spearmint (stale!)
- Talos
AutoML & Neural Architecture Search (NAS)
Similarly, oftentimes the AI system can take care of the entire model creation. In order to do so, one needs AutoAI as well as Neural Architecture Search (NAS) to find neural network topologies. The University of Freiburg maintains the great page about the underlying techniques at automl.org. A few additional tools worth looking at are:
Model and Data Catalogs
Another aspect of FfDL’s ecosystem was that there were model marketplaces that allowed you to put a standardized interface on your model and store them in a searchable catalog. Ours were the Model Asset eXchange (MAX) and OpenAIHub.

Model Zoos & Catalogs
Again, nowadays the concept is really caught on and there are lots and lots of model market places to choose from including commercial offerings:
- Acumos
- Caffe Model Zoo, Github
- IBM BotAssetExchange
- IBM Model Catalog, (on Docker Hub)
- IBM Model Asset eXchange (MAX) and Machine Learning eXchange (MLX)
- IBM OpenAIHub
- Caffe2 Models
- CNTK’s Pretrained Model List
- DL4J’s Zoo Models
- Gluon Model Zoo
- Microsoft Azure Gallery
- Microsoft Model Gallery
- MIT ModelDB
- modelzoo.co
- Model Zoo for AI Model Efficiency Toolkit
- mxnet’s Model Zoo
- Neon Model Zoo
- ONNX Models
- PyTorch Models
- TensorFlow Hub, tfhub.dev
- Tensorflow Models
- Torchvision Models
Primarily Commercial Model Marketplaces
- Algorithmia
- BigML Gallery
- Google AI Hub, alternative, docs
- ModelDepot
- Wolfram Research’s Neural Net Repository
Data Catalogs
The same holds for data catalogs such as:
- Awesome Public Datasets – see esp. PublicDomains and SearchEngines
- IBM Watson Knowledge Catalog
- Data.gov
- DL4J Open Datasets
- EPA Dataset Gateway
- Google Dataset Search
- US Department of Commerce
- World Bank Data Catalog
Model Formats
Furthermore, some standard formats for exchanging models have emerged such as:
- Neural Network Exchange Format (NNEF)
- ONNX, IBM Research contributed Tensorflow Backend for ONNX
- Predictive Model Markup Language (PMML)
- Portable Format for Analytics (PFA)
Feature Stores
One category that we did not have on our radar and that has become crucial since is the topic of feature stores like AWS SageMaker Feature Store, Databricks Feature Store, DoorDash Feature Store, Google Feast (Google Blog), Hopsworks, and others. They allow to not only store features, but also automate their computation, share them across the entire lifecycle as well as different pipelines and more. This makes them essential to MLOps which, however, is a world in itself and thus a topic for another post.
Leave a Reply
Want to join the discussion?Feel free to contribute!