man working out

Enterprise Model Training Exemplified Through the Fabric for Deep Learning (FfDL)

I worked on IBM’s Fabric for Deep Learning (FfDL) which was to the best of my knowledge the first cloud-based deep learning training solution and which received InfoWorld’s Best of Open Source (BOSSIE) award in 2018. Let us explore what an Enterprise-grade training service for neural networks looks like.

While the deep learning world moves quickly and today you would probably use Kubeflow and the frameworks built upon it like mlflow and OpenDataHub, it is still insightful to take a look back at FfDL: In an Enterprise context it is not enough to just train models like you would in most academic settings. Neural models can be prone to adversarial attacks which they need to be hardened against, you need to execute bias detection to make sure they are fair and you might need to employ explainability techniques since in many domains models are unacceptable that cannot properly justify their decisions (e.g., in medicine, law, finance). Furthermore, it might be necessary to compress them (e.g., for use on phones or embedded systems), quantify their uncertainty etc. But there are also questions how to stay framework agnostic, how to speed up distributed training, how to continuously retrain models (esp. since deep learning is prone to catastrophic forgetting), how to do hyperparameter optimization and automatic machine learning and how to share models via model marketplaces. Furthermore, you need to track changes to the model including provenance, model drift and collect crucial characteristics in factsheets which serve a similar purpose to nutrition labels for food. FfDL was probably the first open framework that addressed all of these issues comprehensively. Let’s visit these aspects one-by-one.

Deep Learning Frameworks and Platforms

Let us start with the obvious: There are many deep learning frameworks out there. A few of the more prominent ones are:

This list could be extended significantly. In order to remain vendor neutral it is thus necessary to build and maintain many different pods to cover each of the relevant versions for each of the relevant frameworks. [Usually, public images are not sufficient for this – someone might try to infiltrate your infrastructure from non-hardened images or start bitcoin mining among many risks.]

Basic Architecture

The general architecture of the system looks like this:

There is a trainer service which tracks individual model training jobs, a lifecycle manager that controls the provisioning of training pods on top of Kubernetes, you need a training data service, since your training, test and validation sets needs to be loaded from S3/COS and dropped the moment they are no longer needed, but you also need NFS mounts for operative storage. Then there is the actual learner pod which is specific for the framework used and there are both metrics collectors which push into Prometheus as well as log collectors which push via fluentd into Elasticsearch with Kibana on top. For distributed training you can choose between parameter servers, Horovod and PyTorch distribution. We worked with the Horovod team to have Horovod 1.0 support on launch day and also integrated with other solutions like H2O and seldon.

DL Compilers

While we did not put too much emphasis on it in the early days, nowadays you should also look at DL compilers to significantly speed up your models and target specific platforms.

A good example for this is how Amazon SageMaker Neo leverages Apache TVM.


Nowadays, there is a plethora of AI platforms to choose from, e.g.,

I might write another post in the future comparing the basic architectures. While FfDL’s design is still relevant, it did not have its own inference component, feature store or data catalog, so looking at architectures like Lyft Flyte and AirBnb BigHead is a worthwhile exercise:


As aforementioned, a significant part of the value proposition of Enteprise model training goes beyond mere training, but supporting the entire lifecycle of models and ensuring their quality from explainability over debiasing to adversarial hardening, compression, uncertainty quantification and similar techniques. The following toolkits help with this:

Hyperparameter Optimization (HPO)

Similarly, it can be quite tedious and require a lot of practical experience to pick hyperparameters. For a platform it is desirable to offer more automated and rigorous solutions to the problem. Some HPO frameworks are:

AutoML & Neural Architecture Search (NAS)

Similarly, oftentimes the AI system can take care of the entire model creation. In order to do so, one needs AutoAI as well as Neural Architecture Search (NAS) to find neural network topologies. The University of Freiburg maintains the great page about the underlying techniques at A few additional tools worth looking at are:

Model and Data Catalogs

Another aspect of FfDL’s ecosystem was that there were model marketplaces that allowed you to put a standardized interface on your model and store them in a searchable catalog. Ours were the Model Asset eXchange (MAX) and OpenAIHub.

Model Zoos & Catalogs

Again, nowadays the concept is really caught on and there are lots and lots of model market places to choose from including commercial offerings:

Primarily Commercial Model Marketplaces

Data Catalogs

The same holds for data catalogs such as:

Model Formats

Furthermore, some standard formats for exchanging models have emerged such as:

Feature Stores

One category that we did not have on our radar and that has become crucial since is the topic of feature stores like AWS SageMaker Feature Store, Databricks Feature Store, DoorDash Feature Store, Google Feast (Google Blog), Hopsworks, and others. They allow to not only store features, but also automate their computation, share them across the entire lifecycle as well as different pipelines and more. This makes them essential to MLOps which, however, is a world in itself and thus a topic for another post.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply