Reinforcement Learning Frameworks – An Overview

Falk Pollok, MIT-IBM Watson AI Lab Cambridge, USA

Abstract Reinforcement Learning (RL) has seen renewed interest sparked by the successful combination of RL with neural models as well as Monte-Carlo Tree Search (MCTS). At first, this development was largely restricted to playing traditional games and video games, but successively one can observe more widespread usage in industry as well from robotics and autonomous cars to datacenter and warehouse optimization. This survey takes a look at frameworks for RL which have unlike their deep learning counterparts not yet seen a consolidation out of which only a few winners emerge. It compares them in terms of focus and functionality and arrives at recommendations for future developments.

I. Introduction

Reinforcement learning (RL) is together with supervised and unsupervised learning one of the three pillars of machine learning. It models how an agent interacts with an environment in order to maximize a cumulative reward. While early RL applications like Gerald Tesauro’s TD-Gammon already used RL algorithms (TD-lambda) in conjunction with a neural network for playing backgammon, RL has seen renewed interest in recent times. This was especially caused by the combination of RL with other techniques: On the one hand with deep learning (DL) for function approximation, e.g. using Deep Q Networks (DQNs) to play Atari games. For games like Go or real-world scenarios like autonomous driving the state space can become prohibitively large, but the approximation of values or policies via DL can alleviate this. On the other hand, there is the combination with Monte-Carlo Tree Search (MCTS) and self-play esp. in the Alpha systems which achieved superhuman performance first in Go and then in chess and shogi as well as StarCraft II before being applied to protein folding. While games are a prevalent application of RL, other applications include autonomous driving, resource allocation, data center cooling, business process optimization, neural architecture search, machine reading, dialog management, fleet logistics and omni-channel marketing among others.

Katsunari Shibata  [1] pointed out that many traits naturally emerge in end-to-end RL including attention, exploratory behavior, memory and knowledge transfer. DeepMind’s IMPALA scales from one to thousands of nodes and also allows for advanced techniques like multi-task learning rather than restarting from scratch for every task. Other techniques RL can benefit from include Imitation learning to mimick expert behavior and curriculum learning to alleviate learning complex behavior via ”didactically” offering incrementally difficult tasks. For instance, it is challenging for an agent to learn to pull a chair to a wall in order to jump over it, whereas first learning to directly jump over a low wall followed by pulling up the chair converges much faster. RL has also been shown to generalize reasonably well within a domain, e.g. by playing over 50 different Atari games with one model. While most of RL focuses on individual agents, the field of multi-agent reinforcement learning (MARL) comprises multiple agents in cooperative or competitive settings. These characteristics motivate the use of frameworks and platforms to support researchers and engineers both with reusable componentry, but also to handle system aspects and support the plethora of techniques and scenarios RL can be used with. Interestingly, many researchers report to primarily use established DL frameworks like PyTorch or Tensorflow rather than specialized RL frameworks. This survey is intended to present a wide overview of RL frameworks along with their value propositions in order to provide them with a more comprehensive set of tools. Finally, this survey focuses on Deep RL (DRL) – old-fashioned RL frameworks like BURLAP are outside its scope.

II. RL Frameworks

A. Terminology and Taxonomy of RL Frameworks

An RL framework is a piece of software that provides the foundational componentry to build RL applications or experiments including the core algorithms, exploration strategies, replay buffers, pretrained agents and environment interfaces. An RL environment is a system designed to be interacted with by one or many RL agents, usually via taking actions and observing state changes, reward, additional information and termination. We compare RL frameworks based on programming language, type of supported algorithms, ties to other frameworks, scalability, single or multi agent RL, maturity and popularity. We distinguish three main classes of RL frameworks: Traditional ones focusing on model-free approaches, comprehensive frameworks that cover additional approaches like evolutionary strategies or model-based RL and finally frameworks that exclusively focus on MARL.

B. Traditional Model-Free RL Frameworks

The largest class are DRL frameworks which focus primarily on model-free RL, a well-established class of algorithms. Baselines, for instance, is a collection of RL baseline implementations that was started by OpenAI, but is in maintenance status and has been criticized for having suboptimal documentation and modularization. There is a popular fork out of INRIA that improves over the original version in several regards like better documentation, test coverage, Tensorboard and Jupyter notebook support, a common interface inspired by scikit-learn as well as custom policy and callback support (e.g. for monitoring). It also provides more algorithms and a zoo of over 100 pre-trained agents available for it.

Tensorflow has the highest number of DRL frameworks that are based on it. Dopamine  [2] aims at rapid prototyping of research ideas, but is restricted to value-based RL. They currently focus on Rainbow extensions like n-step Bellman updates, PER and distributional extensions in an Atari environment (Gym wrapper for ALE), but also aim at reproducibility. There are currently four algorithms available: A basic DQN, Rainbow as well as C51 as a special parameterization of it and IQN. It comes with Tensorboard support and is configured via the Gin framework. DeepMind released TensorFlow Reinforcement Learning (TRFL, pronounced ”truffle”) which is a collection of mathematical primitives for reinforcement learning. While useful for research, it is much more technical and fine grained than the other frameworks discussed here. For instance, it will contain TD(0) learning loss, Expected SARSA (SARSE) loss or V-trace actor critic targets. Several of its functions (e.g. Distributional Double Q-Learning) are tied to Tensorflow. Huskarl  [3] is based on Tensorflow 2.0 and Keras. It has aspirations to add Unity environments, curiosity-driven exploration and MARL, but currently primarily works with traditional algorithms against Gym environments. It has basic parallelization support, i.e. it can run multiple environment instances in parallel on several CPU cores, but seems to lack more sophisticated approaches. Keras-RL  [4] is a Keras-based DRL framework. While very popular according to Github stars, it does not seem too well modularized and has minor downsides regarding updates, documentation and reliability – at the time of writing its builds are failing. It supports Weights & Biases to plot its metrics and while it seems to have Tensorboard support, its status is unclear. TF-Agents  [5], unlike Dopamine and TRFL, has more mature documentation and comes with multi-armed bandit agents and environments. Like Dopamine it uses gin-config and it can leverage TF-Eager for debugging. It also allows for training on multiple instances in parallel via its ParallelPyEnvironment. Tensorflow 2 Reinforcement Learning (tf2rl) is a very young framework that supports model-free RL as well as imitation learning. simple_rl  [6] was inspired by BURLAP and focuses on simplicity: Its fundamental concepts are agents (including reference implementations) and environments (based on an MDP class). The latter can be a GymMDP for OpenAI Gym, object-oriented MDPs (OOMDP), a k-armed bandit, a POMDP, a probability distribution over MDPs or a Markov Game, cmp.  [6]. The components hide the complexity needed to track experiments and visualize results. Every experiment run creates a file with the exact parameters used that can be leveraged for reproducibility. However, the implementation is still a work in progress and currently limited to pure MDPs. Finally, SimpleRL comes with a set of utilities like a planning module that includes value iteration, MCTS, bounded RTDP and a sparse sampling algorithm.

Tensorforce  [78] originated at the University of Cambridge and strives to be language agnostic by keeping the RL logic in TF computation graphs. Moreover, it strictly separated abstract RL algorithms from domain-specific information like concrete input and output structure to keep them universally applicable and it comes with a plethora of components including different memory types, optimization algorithms, training strategies as well as environment adapters for environments like ALE, Gym and Retro. In  [8] the authors complain that some open source RL frameworks rely on fixed neural network architectures and may internally apply heuristics to reduce complexity without making this properly transparent to the user. Avoiding this is another goal of the framework. Tensorforce is the backend of LIFT, an end-to-end stack for DRL.

Similarly, there are many PyTorch-based frameworks for model-free RL. Catalyst.RL  [9] is part of a larger ecosystem (”Catalyst.Ecosystem”) which also includes Alchemy (experiment logging and visualization), MLcomp (DAG-based ML pipelines with web GUI) and Reaction (model serving). It has support for distributed training and stores all parameters in yaml configuration files for reproducibility. Furthermore, it leverages TensorboardX to visualize metrics. On-policy algorithms and off-policy algorithms for discrete control settings are still absent from the library. Lagom  [10] covers both model-free RL and evolutionary strategies. It supports basic parallelization and can perform hyperparameter optimization via both grid search and random search. Furthermore, it has some basic visualization support. While most RL frameworks do not incorporate all three paradigms of model-free RL, i.e. Q-learning, policy gradient and Q-value policy gradient approaches, rlpyt  [11] is focused on supporting all three in a common framework and targets small to medium scale research projects. Its main components are a collection of modular RL componentry as well as infrastructure for parallel execution. It uses PyTorch’s solutions for multi-GPU (NCCL) and multi-CPU optimization (gloo). It does not support fully asynchronous optimization schemes, but can run the sampler and optimizer asynchronously via a replay buffer. Historically, rlpyt originated at Berkeley and is based on accel_rl which in turn was strongly influenced by rllab (now garage). SLM-Lab provides modularized versions of many esp. model-free RL algorithms, but is unique in that it uses class inheritance to represent which algorithms followed each other in research. For instance, SARSA influenced DQN on which Double DQN and Double DQN with PER were based, so this will be the inheritance structure in SLM-Lab. SURREAL  [12] out of Stanford focuses on robotics as well as distributed RL training and simulation. According to their repository they can ”scale to thousands of CPUs and hundreds of GPUs”. They combine on-policy (PPO) and off-policy (DDPG) approaches by following an actor model for producing experience data in parallel with a centralized buffer and model learner (with multi-GPU capabilities). For on-policy training this buffer can just be a FIFO queue and train directly, whereas for off-policy learning it is a replay memory that allows for batch sampling of collected experience. In order to support reproducibility as well as scalability SURREAL distinguishes four infrastructure layers: Its provisioner is used to provision cloud resources, the Kubernetes-based orchestrator provides the API, the SURREAL protocol handles communication and the algorithms are the actual RL implementations. Their direct target is the SURREAL Robotics Suite with tasks such as block lifting and stacking as well as nut-and-peg assembly.

Facebook ReAgent  [13] is targeted at production, e.g. to optimize streaming ABR for 360 Video, for M suggestions in Messenger and to maximize notification relevance. Its training is executed in PyTorch (including distribution), whereas serving is done in Caffe via ONNX. With large datasets and slow feedback loops ReAgent is forced to choose different approaches than pure research frameworks. For instance, data preprocessing can be run via Spark and there is a feature classification mechanism that distinguishes feature types (binary, probability, continuous, enum, quantile or boxcox) to derive how to normalize them during training. Furthermore, ReAgent employs Counterfactual Policy Evaluation to estimate agent performance offline and thus avoid extensive A/B testing (even though it can be combined with it) and degradation of user experience. The conflict between conventional RL algorithms that benefit from shuffling in order to obtain pseudo-i.i.d. data and CPE which requires cumulative, step-wise data is addressed by sampling data during training and sorting it at the end of each epoch to retrieve the original sequence and then conduct CPE. All of the supported algorithms are off-policy and thus do not require exploration at runtime and can hence wait days for a reward signal. ReAgent currently supports various types of DQN as well as DDPG and SAC.

Some RL frameworks are built upon other DL frameworks: ChainerRL  [14] is based on Chainer. It delivers a wide selection of algorithms, exploration techniques, neural network architectures, replay buffers, distributions and action values spanning different training approaches, i.e. serial as well as synchronous and asynchronous parallel training, and comes with a dedicated visualization framework called ChainerRL-Visualizer. For reproducibility it provides single file implementations of papers which have been verified to authentically reproduce the published results. Paddle Paddle Reinforcement Learning (PARL) seems similar to Dopamine in that it focuses on only a few RL algorithms – DQN, DDPG and PPO. It has three basic abstractions – a model that is the policy or critic network, the algorithm that updates the model’s parameters and finally the agent component that connects the algorithm with an environment. Intervention Aided Reinforcement Learning (IARL) seems prepared, but was still absent at the time of writing.

However, it is not clear that depending on any particular DL framework is desirable. Some frameworks strive for agnosticism instead: Garage  [15], the successor of rllab, originated at Berkeley and OpenAI and supports both PyTorch and Tensorflow. Besides the typical RL components like algorithms, replay buffers and samplers it offers Tensorboard integration, reproducibility features and checkpointing. MushroomRL  [16] covers value-based, policy-based and actor-critic methods. While generally agnostic to frameworks, it generally bases its DRL code on PyTorch. A goal of the framework is to support rapid prototyping by providing a comprehensive set of components that are easy to combine and extend (e.g. via callbacks) while hiding low-level details. One way to achieve this is its common interface for RL techniques which comprises both shallow and deep RL, on- and off-policy methods, batch and online training as well as episodic and infinite horizon tasks.

C. Comprehensive Frameworks

Many frameworks go beyond model-free RL. DeeR covers a wide range of RL classes, i.e. value-, policy- and model-based RL. While it is largely agnostic to DL frameworks, its examples use Keras. One unusual concept is that it provides controllers which can adapt parameters during training.

Ray  [17] is a real-time AI platform developed by Berkeley’s RISELab and includes a wider ecosystem including the RL components library RLlib  [18] which resides on top of Ray, Tune for HPO, etc. Ray is based on the realization that emerging AI workloads differ from previous workloads in that rather than single predictions they require sequences of actions in dynamic rather than static environments where rewards might be delayed as opposed to immediate feedback in traditional ML settings like supervised learning. As a result, these workloads are much more heterogeneous – rather than merely maximizing GPU utilization in batch-style training, they might have more CPU-intensive and more GPU-intensive phases and they might be modeled better as dynamic task graphs. They are also more likely to combine entirely different approaches like DL, RL, Automated Planning (AP), reasoning, Monte-Carlo Tree Search (MCTS) and simulations with fine-grained data and arbitrary task dependencies. At the same time, they need to scale to hundreds or even thousands of nodes with sub-millisecond latencies with up to millions of tasks per second while remaining fault tolerant. Ray is based on annotations and thus minimally invasive to a codebase. By adding @ray.remote to functions, they get converted into asynchronously callable remote methods that covertly put their arguments into an object store and replace their original return value by a future. The ray.remote annotation can contain CPU and GPU requirements, e.g. @ray.remote(num_gpus=1), as well as custom resources like datasets, accelerators like FPGAs or neuromorphic devices, but also memory configurations. By adding the same annotation to classes, they get converted into actors, thus delivering actor-based programming (not unlike SURREAL) through the same mechanism. The main difference between actors and remote functions is that actors can carry state which is handy for simulators or neural networks.

Intel RL Coach  [19] provides imitation learning and MARL besides value- and policy-based methods. It goes beyond OpenAI Gym and also supports environments like DeepMind Control Suite, Starcraft II, CARLA Gym Extensions and Roboschool. To improve reproducibility Coach employs rigorous testing (called Benchmarks) that run each algorithm against a subset of the environments used in the original paper to ensure the results match the published claims. Coach provides a dashboard that can not only compare multiple experiments, but also show e-metrics in real time as well as for multiple actors if the algorithm uses them (e.g. A3C). Coach can horizontally scale out these rollout workers with synchronization either being synchronous for on-policy methods or asynchronous for off-policy training. Similar to Ray’s dynamic task graph Coach represents agents and environments in a graph. For hierarchical RL (HRL) settings this graph can become complex and contain multiple levels and master policy agents can direct sub-policy agents. This mechanism has three stages – a heatup to fill the replay buffers, the actual training phase where the agent runs against the environment to learn a policy and finally an evaluation phase where the agent only exploits the learned policy (averaged over multiple runs) in order to assess its performance. Additional features include its input embedders, middleware layer and output heads structure for neural networks.

RLgraph  [20] is focused on building backend-agnostic component graphs. It thus separates component composition from the actual backends for deep learning and distribution. For instance, it can run as a Tensorflow compute graph, in PyTorch, on top of Ray or via Uber’s / LFAI’s MPI-based Horovod. It also allows testing of subgraphs and generally accelerates rapid prototyping. The fact that the authors of Tensorforce and RLgraph largely overlap is embodied in the agent API which is similar between both projects.

D. MARL Frameworks

PyMARL is a framework out of Oxford’s Whiteson Research Lab. It is based on PyTorch and was released in conjunction with SMAC, the Starcraft Multi-Agent Challenge which is also its target environment. While MARL capabilities of frameworks like PyMARL, RLlib and Coach are generally targeted at only a few agents, MAgent  [21] focuses on many-agent reinforcement learning, i.e. settings with up to one million agents on one GPU server, to research Artificial Collective Intelligence (ACI). It comes with three settings: Pursuit (yielding predator formations for hunting prey), gathering (resulting in food gathering behavior) and battle (emerging mixture of collaboration and competition).

III. RL Environments

In Artificial Intelligence one usually distinguishes environments based on seven to eight axes:

  • Is it simulated, situated or embodied? Simulated environments run in a separate simulation process, in situated settings the agent operates directly in an environment and embodied means that the agent has a physical manifestation in the real world.
  • Is it static or dynamic? Dynamic environments can change while the agent takes an action, static ones cannot.
  • Are the action and observation space discrete or continuous, i.e. is there a fixed number of actions or perceptions respectively?
  • Is it fully or partially observable, i.e. can the agent observe the entire environment at once?
  • Is it episodic or sequential? In episodic environments the agent gets independent rewards for every action, whereas in sequential settings it only gets a reward after a number of steps.
  • Is it a single or multi-agent environment?
  • Is the environment known or unknown, i.e. does the agent have a model of the environment dynamics? This axis is also reflected in model-free vs model-based RL and might thus depend more on the type of RL than the type of environment.

The arguably most relevant environment is OpenAI Gym  [22] which is a standardized interface for single agent RL. It comes with eight types of environments: Textual ones, algorithmic ones, Atari 2600 games (with pixels or RAM content as input), continuous control tasks in the Box2D simulator, continuous control tasks in the commercial MuJoCo simulator, classic control theory tasks, simulated robotics goal-based tasks and custom environments. Interacting with a Gym environment tends to follow the same pattern: The environment is reset in step 0 which returns the first observation. Afterwards, steps are taken each of which returns a 4-tuple consisting of the next observation, reward, a boolean indicating whether the episode has ended and a field with debug information that should not be used in the agent’s decision process. Gym is widely used and supported by virtually all RL frameworks.

However, Gym environments are generally less well suited for specialized settings like MARL or multi-modality. Most environments are either 2D or 3D games including video games, video game engines and traditional games like card or board games. Regarding 2D this includes the Arcade Learning EnvironmentFacebook ELF, the MAME ToolkitMulti-Agent Particle EnvironmentOpenAI Retro (which supercedes RLE and Universe), DeepMind OpenSpiel and the Hanabi Learning Environment among others. Regarding 3D this includes AI2-THORAnimalAI OlympicsCHALETDeepMind LabDeepMind SC2LEFacebook House3DHabitatHolodeckHoMEMalmöMatterport3DMIT ThreeDWorld (TDW), the OpenAI Multiagent CompetitionOpenAI RoboSumoUnity ML-Agents and VizDoom.

Another particularly popular class are gridworlds and mazes including BabyAIDeepMind pycolabFacebook MazeBasemazelab and many others. A new class of environments is specifically targeted at control and safety, e.g. DeepMind AI Safety GridworldsDeepMind Control Suite and Safety Gym. However, many other RL environments do exist, for instance for trading, task queueing, emergency response, scheduling and theorem proving.

Finally, it is feasible to leverage existing simulation software within RL environments, in particular for robotics including autonomous cars, drones and other vehicles, but also industrial robots, e.g. AirSimGibson envGym GazeboMINOSMuJoCo, the Neurorobotics Platform (NRP)OpenAI RoboSchoolPaddlePaddle RLSchoolrobosuite and the S-RL Toolbox. Furthermore, this includes discrete event simulation like AnyLogicSiemens PLM or SIMUL8, chemical and engineering simulations like CHEMCADMATLAB & Simulink or Sinumerik, but also medical or pharmacological (e.g. GastroPlus), networking (e.g. CloudSim or NS3), military (e.g. BISim VBS4), transportation (e.g. Anylogistix) and even urban or governmental planning simulations (e.g. UrbanSim).

IV. Visualization

Visualization and dashboards are important to track and debug reinforcement learning algorithms and training progress. Currently, there are three major dashboard options for RL: For TensorBoard there is an interpretability dashboard extension by Andrew Schreiber that was recently added to TensorBoard. One distinguishing factor in this extension is that it can render the environment with added perturbation saliency heatmaps which visualize where the model attention is focused. Some RL frameworks like Stable Baselines also have explicit instructions on how to integrate with TensorBoard. If the underlying DL framework is not Tensorflow, but e.g. Chainer or PyTorch, TensorboardX can be used. An alternative is Facebook’s Visdom for more human-like RL. Intel Coach comes with its own dashboard that can visualize signals from several workers – A3C, for instance, spawns multiple actors. It also allows combining multiple runs of the algorithm into one set to account for the fact that RL algorithms tend to be unstable.

While Guo et al.  [23] are primarily interested in applying DQNs to Atari games, they visualize their first- and second-layer filters and additionally use the optimal stimuli method to show which features the CNN learned, thus conveying which input patches generated the largest response.

Mnih et al.  [24] use two-dimensional t-SNE embeddings to visualize the representation learned by their DQN by recording the last hidden layer representations of the DQN for each game state for 2 hours of gameplay. While it is not surprising that this procedure mapped visually similar states to points that are spatially close to each other, it is interesting that it also did so for states that are similar in estimated value. Furthermore, when mapping both human and AI states into the same space both reveal a similar structure which indicates that the learned representations generalize to data that originated from a foreign policy. Finally, they also visualize the value and action-value functions over time.

Zahavy et al.  [25] contribute both a methodology and a set of tools to understand DQNs which they use to explain three Atari games. For instance, they employ three-dimensional t-SNE representations to visualize state transitions and leverage saliency maps to highlight which image regions have the highest impact on the neural network’s value predictions. Finally, they introduce Semi Aggregated MDPs (SAMDPs) that provide clear spatio-temporal abstractions leading towards subgoal detection.

V. Industry Solutions

The following section gives a broad overview of RL usage in industry. It is clearly desirable that future DRL frameworks are more targeted towards productive use in Enterprise and industrial applications.

AWS has launched Sagemaker RL on November 28th 2018. They support Intel Coach, Ray RLlib and Baselines as toolkits that provide agent implementations and differentiate four kinds of environments: The AWS-specific simulation environments Sumerian and RoboMaker, open source environments like RoboSchool, Gym or EnergyPlus, custom environments and commercial simulators like MATLAB with Simulink. The latter three are offered via custom containers, MATLAB additionally requires the user to manage her own license. Note that Baselines seems to be Stable Baselines. Furthermore, Sagemaker RL supports distributed training and HPO.

Bonsai is a Berkeley-based startup that was acquired by Microsoft on June 20th 2018. It currently has 144 employees according to LinkedIn with a Series A funding of $13.6 million according to Crunchbase. Besides Bonsai Microsoft also recently acquired Maluuba to form Microsoft Montreal and thus its Reinforcement Learning Group. They are primarily working in Optimization, Control as well as Monitoring and Maintenance. They integrate with Matlab and Simulink to integrate with engineering models developed with these tools like wind turbine controls. http://prowler.io/ is a Cambridge, UK-based startup with a Series A funding of $14.9 million according to Crunchbase. Their Vuku architecture has two main components: A decision making component is used to choose actions and a learning system learns predictive models.

In research DeepMind and OpenAI are among the most famous institutes. OpenAI also has an internal platform called OpenAI Rapid that seems to be primarily used by their DotA team, but that is also offered to other teams within the company. Borealis AI is an RL-focused research institute out of the Royal Bank of Canada. Imandra provides reasoning as a service including symbolic reasoning and formal verification, but also incorporates DRL with applications in finance, robotics and autonomous systems. nnaisense, Jürgen Schmidhuber’s research-focused company, applies RL to wide range of applications including autonomous systems and finance.

Robotics is one core field of RL. micropsi industries is based in Germany and provides robotics software and in particular the machine teaching & control system MIRAI which is trained end-to-end and targeted at assembly tasks. Osaro provides combined computer vision and decision making solutions for robotic systems, esp. in warehouse automation for distribution centers and manufacturing. Covariant (formerly Embodied Intelligence) is a California-based startup co-founded by Pieter Abbeel for machine teaching via deep RL and imitation learning. AI4Things build DRL-based solutions for intelligent machines. DoraBot provides robotics for logistics and SoarTech provides autonomous systems and decision support for military applications.

Many companies in autonomous vehicle research (AVR) leverage RL including traditional car manufacturers like BMWMercedes Benz, transportation service providers like Lyft’s Palo Alto-based Level 5 Self-Driving Division and Uber’s Advanced Technologies Group as well as companies like AptivWaymo or Bosch. Oxford-based LatentLogic builds realistic behavior models via imitation learning for humanoid agents in simulations for autonomous cars by first extracting and then imitating behavior. Wayve out of Cambridge University develops a self-driving car software platform.

In finance, AI Capital Management and HiHedge offer DRL solutions for trading. In e-Learning Qstream uses it for Microlearning, Desire2Learn provides a learning management system with DRL-based learning material recommendation. In business process optimization CogitAI develops a DRL-based continual learning platform called Continua, DataOne a platform for intelligent business decisions, InstaDeep process optimization for the energy sector, logistics, manufacturing and mobility and MediaGamma develops DRL-based decision support, e.g. for advertising and customer acquisition. PerfectPattern leverages DRL for industrial process control and management. PerimeterX offers predictive security against botnet attacks. Phenomic and ProteinQure apply the technology to drug discovery. Rasa leverages DRL approaches for chatbots, dialog systems and virtual assistants. ThruAI enriches customer service with RL. OPTIMAL is a London- and Rotterdam-based company developing autonomous indoor farming solutions.

This brief market overview is also aimed at showing the discrepancy between requirements in industry and the current focus of most DRL frameworks, e.g. regarding applications, lifecycles and toolchains.

VI. Discussion

The consolidation we observed with deep learning frameworks still has to happen for RL. Most DRL experts seem to use PyTorch and TF directly rather than RL frameworks and frequently cite the amount of flexibility and control they need for their research rather than just combining existing components. MARL, model-based RL, safety and multimodality are oftentimes ignored, since the majority of frameworks focus on subsets of RL, esp. model-free approaches. Reproducibility, an inherent challenge since RL agents interact with dynamic environments, is beginning to emerge – Catalyst, ChainerRL, garage and Dopamine address it explicitly. Distribution is now more widely addressed, but oftentimes frameworks only support single-node parallelism rather than true multi-node horizontal scalability. While there are individual exceptions including Ray and SURREAL, cloud-native architecture and Kubernetes are usually not leveraged, even though Common Resource Definitions (CRDs) might work very well as infrastructure abstractions for agents. Similarly, while evolutionary strategies are supported by frameworks such as garage, Lagom and RLlib and imitation learning by RLlib, RL Coach, SLM Lab, tf2rl and TF Agents, general mechanisms for curriculum learning, hierarchical RL and hybrid agent design in general still seem underdeveloped.

From a larger perspective, it is striking that DRL systems exhibit behavior that is reminiscent of human intuition and are capable of finding novel strategies, but seem to struggle with deep analysis one would expect from symbolic approaches which leads us to suspect that neural-symbolic approaches could prove vital in developing AI systems that exhibit both characteristics. Finally, RL frameworks and applications are still predominantly research and game oriented and usually lack support for industrial and Enterprise applications like digital twins or lifecycle management from coarse-to-fine-grained simulation up to real-world deployment. There is still no interface for MARL and multimodality that is as universally accepted as Gym. Thus, we expect a significant shift in the medium term towards more consolidated frameworks that are closer to real-world applications, leverage cloud native architecture more naturally and provide a wider selection of techniques.  


NameGithub StarsLicenseFramework / LanguageAlgorithmsIntegrated EnvironmentsDistributed ExecutionType of RL?Types of ComponentryAffiliation
OpenAI Baselines9200MITA2C, ACER, ACKTR, DDPG, DQN, GAIL, HER, PPO, TRPOModel-Free RLOpenAI
Stable Baselines1600MITA2C, ACER, ACKTR, DDPG, DQN, GAIL, HER, PPO, SAC, TD3, TRPOModel-Free RLINRIA, ParisTech
CatalystRL1400Apache 2.0PyTorchDQN, DDPG, PPO, SAC, TD3Gym incl. Atari, DM Control SuiteYesModel-free RLDbrain
ChainerRL793MITChainerDQN, DDQN, Categorical DQN, Rainbow, IQN, DDPG, A2C, A3C, ACER, NSQ, PCL, PPO, TRPO, TD3, SAC, PAL, Double PAL, DPP, REINFORCEGym, ALE, Mujoco, BulletOnly multiprocessingModel-Free RLDedicated VisualizerPreferred Networks
DeeR435BSDAgnostic, KerasDQN, DDPG, CRARGym, ALE, PLEModel-Free and Model-Based RL
Dopamine8600Apache 2.0Tensorflow, KerasDQN, C51, Rainbow, IQNValue-Based RLGoogle
garage587MITPyTorch, TensorflowCEM, CMA-ES, REINFORCE, DDPG, DQN, DDQN, ERWR, NPO, PPO, REPS, TD3, TNPG, TRPOES, Model-Free RLBerkeley, OpenAI
Huskarl383MITTensorflow, KerasDQN, Multi-Step DQN, Double DQN, Dueling DQN, A2C, DDPG, PER, PPOGym, Unity plannedParallelization onlyModel-Free RL
KerasRL4400MITKerasDQN, Double DQN, DDPG, CDQN / NAF, CEM, Dueling DQN, SARSAGym, extendableValue-Based RL
Lagom355MITPyTorchCEM, CMA-ES, OpenAI-ES, VPG, PPO, DDPG, TD3, SACGym, DM Control Suite via dm2gymParallelization onlyES and Model-Free RL
MAgent1100MITAgnostic with baseline algorithms in MXNet and TensorflowDQN, DRQN, A2CScales to 1M agents on single GPU server, multi-server support unclearMARL (Many-Agent)Geek.ai
Mushroom RL278MITPyTorch, AgnosticQ-Learning, SARSA, FQI, DQN, DDPG, SAC, TD3, TRPO, PPO, LSPI, PGPE, RWR, eNAC, REINFORCE, GROMDP, REPS, COPDAC-Q, R-Learning, A2C, Stochastic Actor-Critic, True Online SARSA-λ, Expected SARSAGym, DeepMind Control Suite, MuJoCo, PyBullet, ROSModel-Free RLMPI, TU Darmstadt, Politecnico di Milano
PARL570Apache 2.0PaddlePaddleDQN, DDPG, PPO, IMPALA, A2C, TD3, SACYesModel-Free RLBaidu
PyMARL357Apache 2.0PyTorchQMIX, COMA, VDN, IQL, QTRANTied to SMAC as environmentMARL
Ray, Rllib10200Apache 2.0Ray, Python, DL framework agnosticDDPG, TD3, A2C, A3C, PPO / APPO, IMPALA, Ape-X, DQN, Rainbow, Vanilla Policy Gradient, SAC, ARS, ES, QMIX, VDN, IQN, MADDPG, MARWILYesES, MARL, Model-Free RLBerkeley


NameGithub StarsLicenseFramework / LanguageAlgorithmsIntegrated EnvironmentsDistributed ExecutionType of RL?Types of ComponentryAffiliation
ReAgent2400BSDPyTorch, TorchScriptDQN, Double DQN, Dueling DQN, Dueling Double DQN, C51, QR-DQN, TD3, SACModel-Free RLFacebook
RL Coach1600Apache 2.0TensorflowDQN, BootstrappedDQN, UCB via Q Ensembles, QR-DQN, DDQN, Dueling DDQN with PER, MMC, NEC, N-Step Q-Learning, PAL, NAF, Categorical DQN, Rainbow, PG, A3C, PPO, SAC, ACER, Clipped PPO, DDPG, DDPG with HER, DDPG HAC, TD3, Wolpertinger, DFP, Behavioral Cloning, CILCARLA, Gym, Gym Extensions, Roboschool, ViZDoom, PyBullet, StarCraft, DeepMind Control SuiteYesIL, MARL, Model-Free RLAlgorithms, Exploration Techniques, Memory TypesIntel
Rlgraph220Apache 2.0Agnostic including TF, PyTorch and RayDQN, Double DQN, Dueling DQN, PER, DQFD, Ape-X, IMPALA, PPO, SAC, A2C, A3C, REINFORCEYesModel-Free RLUniversity of Cambridge, rlcore, Helmut Schmidt University
rlpyt1200MITPyTorchA2C, PPO, DQN, Double DQN, DDPG, TD3, SAC, Dueling DQN, Categorical DQNModel-Free RLBerkeley
simple rl124Apache 2.0PythonQ-Learning, Rmax, DelayedQ, DoubleQ, Random, DQN, LinUCB, Linear Q-Learning, Planning (Value Iteration, Bounded RTDP, MCTS)Value-Based RL, Planning
SLM Lab656MITPyTorchSARSA, DQN, Double DQN, Dueling DQN, PER, REINFORCE, A2C, PPO, SAC, SILGym, Roboschool, VizDoom, UnityParallelization onlyModel-Free RL, IL
SURREAL391BSDPyTorchDDPG, PPOYesModel-Free RLStanford
Tensorforce2600Apache 2.0TensorflowDQN, Double DQN, Dueling DQN, n-step DQN, NAF, REINFORCE, A3C, PPO, TRPO, DPGALE, Gym, MazeExplorer, Retro, OpenSim, PLE, ViZDoomParallel Execution of Agent and EnvironmentModel-Free RL
TF Agents1100Apache 2.0Tensorflow, KerasDQN, DQN-RNN, DDQN, DDPG, TD3, REINFORCE, PPO, PPO-RNN, SAC, Behavioral CloningGym, Atari, Mujoco, PyBullet, DM Control Suite, Unity ML AgentsIn Development (Multi GPU, TPU)Model-Free RL, ILAgents, Environments, Replay BufferGoogle
TRFL2800Apache 2.0TensorflowModel-Free RLDeepMind

VII. Addendum

A. Frameworks

B. Environments

The following environment list was influenced by RLenv.directory.

Comprehensive Multi-Domains or Hybrid Environments

C. 2D Games and Environments as well as Traditional Games

This section includes gridworlds.

2D Video Games and Simulations

Board Games and Other Traditional Games

Grid Worlds and Mazes

3D Video Games and Non-Realistic Simulations

Robotics and Realistic Simulations
This section includes autonomous verhicle research.

  • Acrobot V-REP – Gym environment for acrobot on V-REP platform with DDPG algorithm, build on Keras-RL
  • AirGym – AirSim integration for Gym and Keras-RL for autonomous quadrocopter
  • AirSim  [66] – Simulator for diverse vehicles like drones & cars, supports both UE and Unity as well as hardware-in-loop
  • Factory RL Gazebo – Youbot in factory environment based on gym-gazebo
  • Gibson env  [67], source – Virtual environment simulator, has integration for onboard camera (”Goggles”)
  • Gym-Duckietown – Self-driving car Gym environments for Duckietown platform, platform started at MIT, simulator at Mila, also includes features for transfer to robot
  • Gym Gazebo  [68] – Extension to original Gym for robotics via Gazebo and ROS
  • Gym V-REP – Gym extension based on V-REP
  • MINOS  [69] – Simulation for multisensory indoor navigation models
  • MuJoCo  [70], usually used via mujoco-py
  • Multi-contact-grasping – Grasp-and-lift process with Barrett Hand in V-REP
  • Neurorobotics Platform (NRP)(source)
  • OpenAI RoboSchool – deprecated in favor of PyBullet
  • PaddlePaddle RLSchool – RL environments for PaddlePaddle, currently includes elevator and quadrocopter simulation
  • PyBullet Gymperium – Alternative, Gym-compatible and free implementations of MuJoCo environments via Bullet physics engine with some Tensorforce agents
  • robosuite, designed to work well with SURREAL
  • Robot Gym – Gazebo Gym environment for RL-based and evolutionary robotics, runs robot in maze
  • Robot Learning Gym (RLG) – Robotics tasks for multiple robots, tasks and RL algorithms to get comparative results with standardized metrics
  • Robotiq-UR5 – MuJoCo-based simulator of UR5 robotic arm with Robotiq gripper
  • SdSandbox – Self driving car simulator based on Unity and Keras as well as Nvidia PilotNet;
  • Donkey Gym (Gym environment for donkeycar) was extracted from sdsandbox
  • Self Driving Sim Gym – Simple 2D Gym-style environments for RL on autonomous car setting with intersection and traffic environment
  • Self driving car sim – Unity-based autonomous car simulator by Udacity
  • S-RL Toolbox  [71], docsvideo – RL and State Representation Learning (SRL) for robotics with 10 Stable-Baselines algorithms, HPO, also see SRL Zoodocs

Control and Safety

Finance, Medicine and Abstract Domains

  • AgentSimulator – Predator & prey simulator in Java
  • Banana Gym – Stochastic Gym environment based on banana selling setting
  • Btgym – Gym environment for Backtrader trading library
  • EnergyPy – RL experiments on energy environments based on Tensorflow
  • EnMAS – Environment for Multi-Agent Simulation, framework for specifying POMDP or POSG problems and agents with clean specification syntax and client-server architecture
  • GamePad  [75] – Python library providing environment for Coq Interactive Theorem Proving (ITP)
  • Gym-bitflip – Bitflip environment, suited for Hindsight Experience Replay
  • Gym-BSS – Gym environment for Bike Sharing System
  • Gym-ERSLE – Gym environment for Emergency Response System (ERS) to solve ambulance allocation problem
  • GymFCvideo  [76] – Gym environment for intelligent flight control systems
  • Gym Memory – Simple 2D Gym environments for memory experiments inspired by rodent experiments
  • Gym Music – Abstract music Gym environment and rewards based on Magenta’s RL Tuner
  • Gym RLCrptocurrency – Gym environment for RLCryptocurrency
  • MiniWoB++  [77] – Extended version of OpenAI’s MiniWoB benchmark that can interact with the web via Selenium
  • Misc RL – Gym environment for ForEx trading
  • OpenSim RLcode – Environments with musculoskeletal model, part of NIPS 2018 AI for prothetics, uses OpenSim for biomechanical simulation
  • Personae – Environment for quantitative trading including stock and future trading
  • PGPortfolio – DRL framework for financial portfolio management, based on  [78]
  • RL aqs – Queuing simulator for adaptive task assignment problems via RL control (SARSA), can run on clusters via MPI
  • Stock Market RL – Keras-based Gym environment for stock market trading, supports PG and DQN, not updated for 2 years
  • TradingGym – Gym-like Trading environment for both RL and rule-based approaches

Language and Communication

D. Simulators

This list was adapted from Mark Hammond, O’Reilly AI San Francisco 2017.

Architecture and Urban Planning

Chemistry

Discrete Events

Game-Based

Mechanical and Electrical Engineering

Medicine and Biotech

Military

Cloud and Networking

Robotics

Transporation

Vehicle (Air, Land, Sea, Space)

E. Algorithms

I follow the Algorithm Taxonomy from  [83] and additionally follow the categorization from Intel Coach as well as OpenAI’s Key Papers in Deep RL collection.

  1. Model-Free Methods
    1 a) Value-Based Optimization

1 b) Policy-Based Optimization and Policy-Gradient

Model-Based Methods

Hierarchical RL

Imitation Learning and Inverse Reinforcement Learning (IRL)

Transfer and Multitask RL

Meta Reinforcement Learning

Memory

Exploration Techniques

Distributed RL

RL Safety

Evolutionary Strategies

Further RL Algorithms

F. Companies

Autonomous Vehicle Research (AVR)

General & Research Labs

  • Borealis AI – AI research institute out of Royal Bank of Canada
  • DeepMind – Famous for beating top human Go players with Alpha Go, success in playing Atari games and developing Neural Turing Machine, acquired by Alphabet
  • Imandra Provides reasoning as a service including symbolic reasoning and formal verification, but also incorporates DRL with applications in finance, robotics and autonomous systems
  • nnaisense – Jürgen Schmidhuber’s research-focused company to apply AI to wide range of applications including autonomous systems and finance
  • OpenAI – non-profit institute for researching safe artificial general intelligence with for-profit arm OpenAI LP
  • Rasa – RL-driven NLU for dialog systems and virtual assistants

Process Optimization, Finance, Security and Pharma

  • AI Capital Management – DRL for Trading
  • Bonsai – DRL solutions for industrial applications including energy systems, HVAC, manufacturing and process automation, based in Berkeley, acquired by Microsoft
  • Cogitai – Continua platform for decision support and business process optimization
  • DataOne – platform for intelligent business decisions, uses RL for model retraining
  • Desire2Learn – learning management system with learning material recommendation
  • HiHedge – DRL for Trading
  • InstaDeep – process optimization for energy sector, manufacturing, mobility and logistics
  • MediaGamma DRL-based decision support, e.g. for advertising and customer acquisition
  • OPTIMAL – autonomous indoor farming solutions
  • PerfectPattern – Industrial Process Control and Management
  • PerimeterX – Bot Defender product provides predictive security against botnet attacks via RL
  • Phenomic – Drug discovery and antibody development with RL suggesting promising syntheses
  • ProteinQure – Drug discovery
  • Prowler – Provides VUKU platform for decision making in diverse domains like autonomous systems, finance and logistics
  • Qstream – Microlearning

Robotics

  • AI4Things – Industrial and agricultural robotics as well as personal delivery and pest control
  • Boston Dynamics
  • Covariant (was Embodied Intelligence) – robot teaching
  • DoraBot – Robotics for logistics
  • micropsi industries – MIRAI robotics platform
  • Osaro – Warehouse automation and robotics, e.g. for distribution centers and manufacturing
  • SoarTech – Autonomous systems and decision support for military

References

[1]    K. Shibata, “Functions that emerge through end-to-end reinforcement learning,” CoRR, vol. abs/1703.02239, 2017. [Online]. Available: http://arxiv.org/abs/1703.02239

[2]    P. S. Castro, S. Moitra, C. Gelada, S. Kumar, and M. G. Bellemare, “Dopamine: A Research Framework for Deep Reinforcement Learning,” 2018. [Online]. Available: http://arxiv.org/abs/1812.06110

[3]    D. Salvadori, “huskarl,” https://github.com/danaugrs/huskarl, 2019.

[4]    M. Plappert, “keras-rl,” https://github.com/keras-rl/keras-rl, 2016.

[5]   Sergio Guadarrama, Anoop Korattikara, “Tf-agents: A library for reinforcement learning in tensorflow,” https://github.com/tensorflow/agents, 2018, [Online; accessed 25-June-2019]. [Online]. Available: https://github.com/tensorflow/agents

[6]    D. Abel, “simple_rl: Reproducible reinforcement learning in python,” 2019.

[7]    M. Schaarschmidt, A. Kuhnle, and K. Fricke, “Tensorforce: A tensorflow library for applied reinforcement learning,” Web page, 2017. [Online]. Available: https://github.com/reinforceio/tensorforce

[8]    M. Schaarschmidt, A. Kuhnle, B. Ellis, K. Fricke, F. Gessert, and E. Yoneki, “LIFT: reinforcement learning in computer systems by learning from demonstrations,” CoRR, vol. abs/1808.07903, 2018. [Online]. Available: http://arxiv.org/abs/1808.07903

[9]    S. Kolesnikov and O. Hrinchuk, “Catalyst.rl: A distributed framework for reproducible RL research,” CoRR, vol. abs/1903.00027, 2019. [Online]. Available: http://arxiv.org/abs/1903.00027

[10]    X. Zuo, “lagom: A pytorch infrastructure for rapid prototyping of reinforcement learning algorithms,” https://github.com/zuoxingdong/lagom, 2018.

[11]    A. Stooke and P. Abbeel, “rlpyt: A research code base for deep reinforcement learning in pytorch,” 2019.

[12]    L. Fan, Y. Zhu, J. Zhu, Z. Liu, O. Zeng, A. Gupta, J. Creus-Costa, S. Savarese, and L. Fei-Fei, “Surreal: Open-source reinforcement learning framework and robot manipulation benchmark,” in Conference on Robot Learning, 2018.

[13]    J. Gauci, E. Conti, Y. Liang, K. Virochsiri, Y. He, Z. Kaden, V. Narayanan, and X. Ye, “Horizon: Facebook’s open source applied reinforcement learning platform,” CoRR, vol. abs/1811.00260, 2018. [Online]. Available: http://arxiv.org/abs/1811.00260

[14]    Y. Fujita, T. Kataoka, P. Nagarajan, and T. Ishikawa, “Chainerrl: A deep reinforcement learning library,” 2019.

[15]    Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” CoRR, vol. abs/1604.06778, 2016. [Online]. Available: http://arxiv.org/abs/1604.06778

[16]    C. D’Eramo, D. Tateo, A. Bonarini, M. Restelli, and J. Peters, “Mushroomrl: Simplifying reinforcement learning research,” 2020.

[17]    P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, W. Paul, M. I. Jordan, and I. Stoica, “Ray: A distributed framework for emerging AI applications,” CoRR, vol. abs/1712.05889, 2017. [Online]. Available: http://arxiv.org/abs/1712.05889

[18]    E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, J. Gonzalez, K. Goldberg, and I. Stoica, “Ray rllib: A composable and scalable reinforcement learning library,” CoRR, vol. abs/1712.09381, 2017. [Online]. Available: http://arxiv.org/abs/1712.09381

[19]    I. Caspi, G. Leibovich, G. Novik, and S. Endrawis, “Reinforcement learning coach,” Dec. 2017. [Online]. Available: https://doi.org/10.5281/zenodo.1134899

[20]    M. Schaarschmidt, S. Mika, K. Fricke, and E. Yoneki, “RLgraph: Flexible Computation Graphs for Deep Reinforcement Learning,” ArXiv e-prints, Oct. 2018.

[21]    L. Zheng, J. Yang, H. Cai, W. Zhang, J. Wang, and Y. Yu, “Magent: A many-agent reinforcement learning platform for artificial collective intelligence,” 2017.

[22]    G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.

[23]    X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang, “Deep learning for real-time atari game play using offline monte-carlo tree search planning,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 3338–3346. [Online]. Available: http://papers.nips.cc/paper/5421-deep-learning-for-real-time-atari-game-play-using-offline-monte-carlo-tree-search-planning.pdf

[24]    V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015. [Online]. Available: http://dx.doi.org/10.1038/nature14236

[25]    T. Zahavy, N. Ben-Zrihem, and S. Mannor, “Graying the black box: Understanding dqns,” CoRR, vol. abs/1602.02658, 2016. [Online]. Available: http://arxiv.org/abs/1602.02658

[26]    D. Cortes, “Adapting multi-armed bandits policies to contextual bandits scenarios,” arXiv preprint arXiv:1811.04383, 2018.

[27]    Z. Shangtong, “Modularized implementation of deep rl algorithms in pytorch,” https://github.com/ShangtongZhang/DeepRL, 2018.

[28]    P. Andersen, M. Goodwin, and O. Granmo, “Flashrl: A reinforcement learning platform for flash games,” CoRR, vol. abs/1801.08841, 2018. [Online]. Available: http://arxiv.org/abs/1801.08841

[29]    I. Caspi, G. Leibovich, G. Novik, and S. Endrawis, “Reinforcement learning coach,” Dec. 2017. [Online]. Available: https://doi.org/10.5281/zenodo.1134899

[30]    T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, T. Rückstieß, and J. Schmidhuber, “Pybrain,” J. Mach. Learn. Res., vol. 11, pp. 743–746, Mar. 2010. [Online]. Available: http://dl.acm.org/citation.cfm?id=1756006.1756030

[31]    B. Tanner and A. White, “Rl-glue: Language-independent software for reinforcement-learning experiments,” J. Mach. Learn. Res., vol. 10, pp. 2133–2136, Dec. 2009. [Online]. Available: http://dl.acm.org/citation.cfm?id=1577069.1755857

[32]    A. Geramifard, C. Dann, R. H. Klein, W. Dabney, and J. P. How, “Rlpy: A value-function-based reinforcement learning framework for education and research,” Journal of Machine Learning Research, vol. 16, no. 46, pp. 1573–1578, 2015. [Online]. Available: http://jmlr.org/papers/v16/geramifard15a.html

[33]    A. Mandlekar, Y. Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, S. Savarese, and L. Fei-Fei, “Roboturk: A crowdsourcing platform for robotic skill learning through imitation,” in Conference on Robot Learning, 2018.

[34]    A. Kirsch, “MDP environments for the openai gym,” CoRR, vol. abs/1709.09069, 2017. [Online]. Available: http://arxiv.org/abs/1709.09069

[35]    M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,” CoRR, vol. abs/1207.4708, 2012. [Online]. Available: http://arxiv.org/abs/1207.4708

[36]    Y. Tian, Q. Gong, W. Shang, Y. Wu, and L. Zitnick, “ELF: an extensive, lightweight and flexible research platform for real-time strategy games,” CoRR, vol. abs/1707.01067, 2017. [Online]. Available: http://arxiv.org/abs/1707.01067

[37]    Y. Li, H. Chang, Y. Lin, P. Wu, and Y. F. Wang, “Deep reinforcement learning for playing 2.5d fighting games,” CoRR, vol. abs/1805.02070, 2018. [Online]. Available: http://arxiv.org/abs/1805.02070

[38]    J. K. Gupta, M. Egorov, and M. Kochenderfer, “Cooperative multi-agent control using deep reinforcement learning,” in International Conference on Autonomous Agents and Multiagent Systems. Springer, 2017, pp. 66–83.

[39]    S. Jiang, “Multi agent reinforcement learning environments compilation,” 2019.

[40]    R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” CoRR, vol. abs/1706.02275, 2017. [Online]. Available: http://arxiv.org/abs/1706.02275

[41]    I. Mordatch and P. Abbeel, “Emergence of grounded compositional language in multi-agent populations,” CoRR, vol. abs/1703.04908, 2017. [Online]. Available: http://arxiv.org/abs/1703.04908

[42]    A. Nichol, V. Pfau, C. Hesse, O. Klimov, and J. Schulman, “Gotta learn fast: A new benchmark for generalization in rl,” arXiv preprint arXiv:1804.03720, 2018.

[43]    N. Bhonker, S. Rozenberg, and I. Hubara, “Playing snes in the retro learning environment,” arXiv preprint arXiv:1611.02205, 2016.

[44]    M. Chevalier-Boisvert, D. Bahdanau, S. Lahlou, L. Willems, C. Saharia, T. H. Nguyen, and Y. Bengio, “Babyai: First steps towards grounded language learning with a human in the loop,” CoRR, vol. abs/1810.08272, 2018. [Online]. Available: http://arxiv.org/abs/1810.08272

[45]    S. Sukhbaatar, A. Szlam, G. Synnaeve, S. Chintala, and R. Fergus, “Mazebase: A sandbox for learning from games,” CoRR, vol. abs/1511.07401, 2015. [Online]. Available: http://arxiv.org/abs/1511.07401

[46]    X. Zuo, “mazelab: A customizable framework to create maze and gridworld environments.” https://github.com/zuoxingdong/mazelab, 2018.

[47]    E. Kolve, R. Mottaghi, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi, “AI2-THOR: an interactive 3d environment for visual AI,” CoRR, vol. abs/1712.05474, 2017. [Online]. Available: http://arxiv.org/abs/1712.05474

[48]    B. Beyret, J. Hernández-Orallo, L. Cheke, M. Halina, M. Shanahan, and M. Crosby, “The animal-ai environment: Training and testing animal-like artificial cognition,” 2019.

[49]    C. Yan, D. K. Misra, A. Bennett, A. Walsman, Y. Bisk, and Y. Artzi, “CHALET: cornell house agent learning environment,” CoRR, vol. abs/1801.07357, 2018. [Online]. Available: http://arxiv.org/abs/1801.07357

[50]    D. K. Misra, A. Bennett, V. Blukis, E. Niklasson, M. Shatkhin, and Y. Artzi, “Mapping instructions to actions in 3d environments with visual goal prediction,” CoRR, vol. abs/1809.00786, 2018. [Online]. Available: http://arxiv.org/abs/1809.00786

[51]    C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V. Valdés, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis, S. Legg, and S. Petersen, “Deepmind lab,” CoRR, vol. abs/1612.03801, 2016. [Online]. Available: http://arxiv.org/abs/1612.03801

[52]    J. Z. Leibo, C. de Masson d’Autume, D. Zoran, D. Amos, C. Beattie, K. Anderson, A. G. Castañeda, M. Sanchez, S. Green, A. Gruslys, S. Legg, D. Hassabis, and M. Botvinick, “Psychlab: A psychology laboratory for deep reinforcement learning agents,” CoRR, vol. abs/1801.08116, 2018. [Online]. Available: http://arxiv.org/abs/1801.08116

[53]    Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A Platform for Embodied AI Research,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.

[54]    Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian, “Building generalizable agents with a realistic and rich 3d environment,” CoRR, vol. abs/1801.02209, 2018. [Online]. Available: http://arxiv.org/abs/1801.02209

[55]    J. Greaves, M. Robinson, N. Walton, M. Mortensen, R. Pottorff, C. Christopherson, D. Hancock, and D. Wingate, “Holodeck: A high fidelity simulator,” 2018.

[56]    S. Brodeur, E. Perez, A. Anand, F. Golemo, L. Celotti, F. Strub, J. Rouat, H. Larochelle, and A. C. Courville, “Home: a household multimodal environment,” CoRR, vol. abs/1711.11017, 2017. [Online]. Available: http://arxiv.org/abs/1711.11017

[57]    M. Johnson, K. Hofmann, T. Hutton, D. Bignell, and K. Hofmann, “The malmo platform for artificial intelligence experimentation.” AAAI – Association for the Advancement of Artificial Intelligence, July 2016. [Online]. Available: https://www.microsoft.com/en-us/research/publication/malmo-platform-artificial-intelligence-experimentation/

[58]    P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. D. Reid, S. Gould, and A. van den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” CoRR, vol. abs/1711.07280, 2017. [Online]. Available: http://arxiv.org/abs/1711.07280

[59]    A. X. Chang, A. Dai, T. A. Funkhouser, M. Halber, M. N. ner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” 2017 International Conference on 3D Vision (3DV), pp. 667–676, 2017.

[60]    T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch, “Emergent complexity via multi-agent competition,” CoRR, vol. abs/1710.03748, 2017. [Online]. Available: http://arxiv.org/abs/1710.03748

[61]    M. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch, and P. Abbeel, “Continuous adaptation via meta-learning in nonstationary and competitive environments,” CoRR, vol. abs/1710.03641, 2017. [Online]. Available: http://arxiv.org/abs/1710.03641

[62]    A. Kumar, N. Paul, and S. N. Omkar, “Bipedal walking robot using deep deterministic policy gradient,” CoRR, vol. abs/1807.05924, 2018. [Online]. Available: http://arxiv.org/abs/1807.05924

[63]    A. Kanervisto and V. Hautamäki, “Torille: Learning environment for hand-to-hand combat,” CoRR, vol. abs/1807.10110, 2018. [Online]. Available: http://arxiv.org/abs/1807.10110

[64]    A. Juliani, V. Berges, E. Vckay, Y. Gao, H. Henry, M. Mattar, and D. Lange, “Unity: A general platform for intelligent agents,” CoRR, vol. abs/1809.02627, 2018. [Online]. Available: http://arxiv.org/abs/1809.02627

[65]    M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaskowski, “Vizdoom: A doom-based AI research platform for visual reinforcement learning,” CoRR, vol. abs/1605.02097, 2016. [Online]. Available: http://arxiv.org/abs/1605.02097

[66]    S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics, 2017. [Online]. Available: https://arxiv.org/abs/1705.05065

[67]    F. Xia, A. R. Zamir, Z.-Y. He, A. Sax, J. Malik, and S. Savarese, “Gibson env: real-world perception for embodied agents,” in Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE, 2018.

[68]    I. Zamora, N. G. Lopez, V. M. Vilches, and A. H. Cordero, “Extending the openai gym for robotics: a toolkit for reinforcement learning using ROS and gazebo,” CoRR, vol. abs/1608.05742, 2016. [Online]. Available: http://arxiv.org/abs/1608.05742

[69]    M. Savva, A. X. Chang, A. Dosovitskiy, T. A. Funkhouser, and V. Koltun, “MINOS: multimodal indoor simulator for navigation in complex environments,” CoRR, vol. abs/1712.03931, 2017. [Online]. Available: http://arxiv.org/abs/1712.03931

[70]    E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033, 2012.

[71]    A. Raffin, A. Hill, R. Traoré, T. Lesort, N. Díaz-Rodríguez, and D. Filliat, “S-rl toolbox: Environments, datasets and evaluation metrics for state representation learning,” arXiv preprint arXiv:1809.09369, 2018.

[72]    J. Leike, M. Martic, V. Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq, L. Orseau, and S. Legg, “AI safety gridworlds,” CoRR, vol. abs/1711.09883, 2017. [Online]. Available: http://arxiv.org/abs/1711.09883

[73]    Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. P. Lillicrap, and M. A. Riedmiller, “Deepmind control suite,” CoRR, vol. abs/1801.00690, 2018. [Online]. Available: http://arxiv.org/abs/1801.00690

[74]    A. Ray, J. Achiam, and D. Amodei, “Benchmarking Safe Exploration in Deep Reinforcement Learning,” 2019.

[75]    D. Huang, P. Dhariwal, D. Song, and I. Sutskever, “Gamepad: A learning environment for theorem proving,” CoRR, vol. abs/1806.00608, 2018. [Online]. Available: http://arxiv.org/abs/1806.00608

[76]    W. Koch, R. Mancuso, R. West, and A. Bestavros, “Reinforcement learning for uav attitude control,” ACM Transactions on Cyber-Physical Systems, vol. 3, no. 2, p. 22, 2019.

[77]    E. Z. Liu, K. Guu, P. Pasupat, T. Shi, and P. Liang, “Reinforcement learning on web interfaces using workflow-guided exploration,” CoRR, vol. abs/1802.08802, 2018. [Online]. Available: http://arxiv.org/abs/1802.08802

[78]    Z. Jiang, D. Xu, and J. Liang, “A deep reinforcement learning framework for the financial portfolio management problem,” 06 2017.

[79]    A. H. Miller, W. Feng, A. Fisch, J. Lu, D. Batra, A. Bordes, D. Parikh, and J. Weston, “Parlai: A dialog research software platform,” CoRR, vol. abs/1705.06476, 2017. [Online]. Available: http://arxiv.org/abs/1705.06476

[80]    M.-A. Côté, A. Kádár, X. Yuan, B. Kybartas, T. Barnes, E. Fine, J. Moore, M. Hausknecht, L. E. Asri, M. Adada, W. Tay, and A. Trischler, “Textworld: A learning environment for text-based games,” CoRR, vol. abs/1806.11532, 2018.

[81]    A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in Proceedings of the 1st Annual Conference on Robot Learning, 2017, pp. 1–16.

[82]    B. Wymann, C. Dimitrakakis, A. D. Sumner, E. Espié, and C. Guionneau, “Torcs : The open racing car simulator,” 2015.

[83]    L. Graesser and W. Keng, Foundations of Deep Reinforcement Learning: Theory and Practice in Python, ser. Addison-Wesley Data & Analytics Series. Pearson Education, 2019. [Online]. Available: https://books.google.com/books?id=0HW7DwAAQBAJ

[84]    M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” CoRR, vol. abs/1707.06887, 2017. [Online]. Available: http://arxiv.org/abs/1707.06887

[85]    H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” CoRR, vol. abs/1509.06461, 2015. [Online]. Available: http://arxiv.org/abs/1509.06461

[86]    V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller, “Playing atari with deep reinforcement learning,” CoRR, vol. abs/1312.5602, 2013. [Online]. Available: http://arxiv.org/abs/1312.5602

[87]    Z. Wang, N. de Freitas, and M. Lanctot, “Dueling network architectures for deep reinforcement learning,” CoRR, vol. abs/1511.06581, 2015. [Online]. Available: http://arxiv.org/abs/1511.06581

[88]    G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos, “Count-based exploration with neural density models,” CoRR, vol. abs/1703.01310, 2017. [Online]. Available: http://arxiv.org/abs/1703.01310

[89]    S. Gu, T. P. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep q-learning with model-based acceleration,” CoRR, vol. abs/1603.00748, 2016. [Online]. Available: http://arxiv.org/abs/1603.00748

[90]    A. Pritzel, B. Uria, S. Srinivasan, A. P. Badia, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell, “Neural episodic control,” CoRR, vol. abs/1703.01988, 2017. [Online]. Available: http://arxiv.org/abs/1703.01988

[91]    V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” CoRR, vol. abs/1602.01783, 2016. [Online]. Available: http://arxiv.org/abs/1602.01783

[92]    M. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, and R. Munos, “Increasing the action gap: New operators for reinforcement learning,” CoRR, vol. abs/1512.04860, 2015. [Online]. Available: http://arxiv.org/abs/1512.04860

[93]    T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. N. Foerster, and S. Whiteson, “QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning,” CoRR, vol. abs/1803.11485, 2018. [Online]. Available: http://arxiv.org/abs/1803.11485

[94]    W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos, “Distributional reinforcement learning with quantile regression,” CoRR, vol. abs/1710.10044, 2017. [Online]. Available: http://arxiv.org/abs/1710.10044

[95]    D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine, “Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation,” CoRR, vol. abs/1806.10293, 2018. [Online]. Available: http://arxiv.org/abs/1806.10293

[96]    M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” 2017.

[97]    V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” CoRR, vol. abs/1602.01783, 2016. [Online]. Available: http://arxiv.org/abs/1602.01783

[98]    Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas, “Sample efficient actor-critic with experience replay,” CoRR, vol. abs/1611.01224, 2016. [Online]. Available: http://arxiv.org/abs/1611.01224

[99]    Y. Wu, E. Mansimov, S. Liao, R. B. Grosse, and J. Ba, “Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation,” CoRR, vol. abs/1708.05144, 2017. [Online]. Available: http://arxiv.org/abs/1708.05144

[100]    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017. [Online]. Available: http://arxiv.org/abs/1707.06347

[101]    G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. TB, A. Muldal, N. Heess, and T. P. Lillicrap, “Distributed distributional deterministic policy gradients,” CoRR, vol. abs/1804.08617, 2018. [Online]. Available: http://arxiv.org/abs/1804.08617

[102]    D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. A. Riedmiller, “Deterministic policy gradient algorithms,” in ICML, 2014.

[103]    T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” 2015.

[104]    J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” CoRR, vol. abs/1506.02438, 2015. [Online]. Available: http://arxiv.org/abs/1506.02438

[105]    S. S. Gu, T. Lillicrap, R. E. Turner, Z. Ghahramani, B. Schölkopf, and S. Levine, “Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 3846–3855. [Online]. Available: http://papers.nips.cc/paper/6974-interpolated-policy-gradient-merging-on-policy-and-off-policy-gradient-estimation-for-deep-reinforcement-learning.pdf

[106]    R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” CoRR, vol. abs/1706.02275, 2017. [Online]. Available: http://arxiv.org/abs/1706.02275

[107]    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017. [Online]. Available: http://arxiv.org/abs/1707.06347

[108]    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” CoRR, vol. abs/1801.01290, 2018. [Online]. Available: http://arxiv.org/abs/1801.01290

[109]    S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” CoRR, vol. abs/1802.09477, 2018. [Online]. Available: http://arxiv.org/abs/1802.09477

[110]    J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,” CoRR, vol. abs/1502.05477, 2015. [Online]. Available: http://arxiv.org/abs/1502.05477

[111]    D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. P. Lillicrap, K. Simonyan, and D. Hassabis, “Mastering chess and shogi by self-play with a general reinforcement learning algorithm,” CoRR, vol. abs/1712.01815, 2017. [Online]. Available: http://arxiv.org/abs/1712.01815

[112]    T. Anthony, Z. Tian, and D. Barber, “Thinking fast and slow with deep learning and tree search,” CoRR, vol. abs/1705.08439, 2017. [Online]. Available: http://arxiv.org/abs/1705.08439

[113]    T. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, R. Pascanu, P. Battaglia, D. Silver, and D. Wierstra, “Imagination-augmented agents for deep reinforcement learning,” CoRR, vol. abs/1707.06203, 2017. [Online]. Available: http://arxiv.org/abs/1707.06203

[114]    I. Clavera, J. Rothfuss, J. Schulman, Y. Fujita, T. Asfour, and P. Abbeel, “Model-based reinforcement learning via meta-policy optimization,” CoRR, vol. abs/1809.05214, 2018. [Online]. Available: http://arxiv.org/abs/1809.05214

[115]    A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning,” CoRR, vol. abs/1708.02596, 2017. [Online]. Available: http://arxiv.org/abs/1708.02596

[116]    T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel, “Model-ensemble trust-region policy optimization,” CoRR, vol. abs/1802.10592, 2018. [Online]. Available: http://arxiv.org/abs/1802.10592

[117]    V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine, “Model-based value estimation for efficient model-free reinforcement learning,” CoRR, vol. abs/1803.00101, 2018. [Online]. Available: http://arxiv.org/abs/1803.00101

[118]    D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution,” CoRR, vol. abs/1809.01999, 2018. [Online]. Available: http://arxiv.org/abs/1809.01999

[119]    J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee, “Sample-efficient reinforcement learning with stochastic ensemble value expansion,” CoRR, vol. abs/1807.01675, 2018. [Online]. Available: http://arxiv.org/abs/1807.01675

[120]    A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu, “Feudal networks for hierarchical reinforcement learning,” CoRR, vol. abs/1703.01161, 2017. [Online]. Available: http://arxiv.org/abs/1703.01161

[121]    A. Levy, R. P. Jr., and K. Saenko, “Hierarchical actor-critic,” CoRR, vol. abs/1712.00948, 2017. [Online]. Available: http://arxiv.org/abs/1712.00948

[122]    O. Nachum, S. Gu, H. Lee, and S. Levine, “Data-efficient hierarchical reinforcement learning,” CoRR, vol. abs/1805.08296, 2018. [Online]. Available: http://arxiv.org/abs/1805.08296

[123]    A. Vezhnevets, V. Mnih, J. Agapiou, S. Osindero, A. Graves, O. Vinyals, and K. Kavukcuoglu, “Strategic attentive writer for learning macro-actions,” CoRR, vol. abs/1606.04695, 2016. [Online]. Available: http://arxiv.org/abs/1606.04695

[124]    F. Codevilla, M. Müller, A. Dosovitskiy, A. López, and V. Koltun, “End-to-end driving via conditional imitation learning,” CoRR, vol. abs/1710.02410, 2017. [Online]. Available: http://arxiv.org/abs/1710.02410

[125]    X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based character skills,” CoRR, vol. abs/1804.02717, 2018. [Online]. Available: http://arxiv.org/abs/1804.02717

[126]    J. Ho and S. Ermon, “Generative adversarial imitation learning,” CoRR, vol. abs/1606.03476, 2016. [Online]. Available: http://arxiv.org/abs/1606.03476

[127]    C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverse optimal control via policy optimization,” CoRR, vol. abs/1603.00448, 2016. [Online]. Available: http://arxiv.org/abs/1603.00448

[128]    Q. Wang, J. Xiong, L. Han, p. sun, H. Liu, and T. Zhang, “Exponentially weighted imitation learning for batched historical data,” in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds. Curran Associates, Inc., 2018, pp. 6288–6297. [Online]. Available: http://papers.nips.cc/paper/7866-exponentially-weighted-imitation-learning-for-batched-historical-data.pdf

[129]    T. L. Paine, S. G. Colmenarejo, Z. Wang, S. E. Reed, Y. Aytar, T. Pfaff, M. W. Hoffman, G. Barth-Maron, S. Cabi, D. Budden, and N. de Freitas, “One-shot high-fidelity imitation: Training large-scale deep nets with RL,” CoRR, vol. abs/1810.05017, 2018. [Online]. Available: http://arxiv.org/abs/1810.05017

[130]    J. Oh, Y. Guo, S. Singh, and H. Lee, “Self-imitation learning,” CoRR, vol. abs/1806.05635, 2018. [Online]. Available: http://arxiv.org/abs/1806.05635

[131]    X. B. Peng, A. Kanazawa, S. Toyer, P. Abbeel, and S. Levine, “Variational discriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining information flow,” CoRR, vol. abs/1810.00821, 2018. [Online]. Available: http://arxiv.org/abs/1810.00821

[132]    S. Cabi, S. G. Colmenarejo, M. W. Hoffman, M. Denil, Z. Wang, and N. de Freitas, “The intentional unintentional agent: Learning to solve many continuous control tasks simultaneously,” CoRR, vol. abs/1707.03300, 2017. [Online]. Available: http://arxiv.org/abs/1707.03300

[133]    M. Wulfmeier, I. Posner, and P. Abbeel, “Mutual alignment transfer learning,” CoRR, vol. abs/1707.07907, 2017. [Online]. Available: http://arxiv.org/abs/1707.07907

[134]    C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra, “Pathnet: Evolution channels gradient descent in super neural networks,” CoRR, vol. abs/1701.08734, 2017. [Online]. Available: http://arxiv.org/abs/1701.08734

[135]    A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” CoRR, vol. abs/1606.04671, 2016. [Online]. Available: http://arxiv.org/abs/1606.04671

[136]    M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised auxiliary tasks,” CoRR, vol. abs/1611.05397, 2016. [Online]. Available: http://arxiv.org/abs/1611.05397

[137]    T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Universal value function approximators,” in Proceedings of the 32Nd International Conference on International Conference on Machine Learning – Volume 37, ser. ICML’15. JMLR.org, 2015, pp. 1312–1320. [Online]. Available: http://dl.acm.org/citation.cfm?id=3045118.3045258

[138]    C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” CoRR, vol. abs/1703.03400, 2017. [Online]. Available: http://arxiv.org/abs/1703.03400

[139]    Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel, “Rl$^2$: Fast reinforcement learning via slow reinforcement learning,” CoRR, vol. abs/1611.02779, 2016. [Online]. Available: http://arxiv.org/abs/1611.02779

[140]    N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, “A simple neural attentive meta-learner,” CoRR, vol. abs/1707.03141, 2017. [Online]. Available: http://arxiv.org/abs/1707.03141

[141]    M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” CoRR, vol. abs/1707.01495, 2017. [Online]. Available: http://arxiv.org/abs/1707.01495

[142]    G. Wayne, C. Hung, D. Amos, M. Mirza, A. Ahuja, A. Grabska-Barwinska, J. W. Rae, P. Mirowski, J. Z. Leibo, A. Santoro, M. Gemici, M. Reynolds, T. Harley, J. Abramson, S. Mohamed, D. J. Rezende, D. Saxton, A. Cain, C. Hillier, D. Silver, K. Kavukcuoglu, M. Botvinick, D. Hassabis, and T. P. Lillicrap, “Unsupervised predictive memory in a goal-directed agent,” CoRR, vol. abs/1803.10760, 2018. [Online]. Available: http://arxiv.org/abs/1803.10760

[143]    C. Blundell, B. Uria, A. Pritzel, Y. Li, A. Ruderman, J. Z. Leibo, J. W. Rae, D. Wierstra, and D. Hassabis, “Model-free episodic control,” CoRR, vol. abs/1606.04460, 2016. [Online]. Available: http://arxiv.org/abs/1606.04460

[144]    E. Parisotto and R. Salakhutdinov, “Neural map: Structured memory for deep reinforcement learning,” CoRR, vol. abs/1702.08360, 2017. [Online]. Available: http://arxiv.org/abs/1702.08360

[145]    T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” CoRR, vol. abs/1511.05952, 2015. [Online]. Available: http://arxiv.org/abs/1511.05952

[146]    A. Santoro, R. Faulkner, D. Raposo, J. W. Rae, M. Chrzanowski, T. Weber, D. Wierstra, O. Vinyals, R. Pascanu, and T. P. Lillicrap, “Relational recurrent neural networks,” CoRR, vol. abs/1806.01822, 2018. [Online]. Available: http://arxiv.org/abs/1806.01822

[147]    I. Osband, C. Blundell, A. Pritzel, and B. V. Roy, “Deep exploration via bootstrapped DQN,” CoRR, vol. abs/1602.04621, 2016. [Online]. Available: http://arxiv.org/abs/1602.04621

[148]    M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos, “Unifying count-based exploration and intrinsic motivation,” CoRR, vol. abs/1606.01868, 2016. [Online]. Available: http://arxiv.org/abs/1606.01868

[149]    B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, “Diversity is all you need: Learning skills without a reward function,” CoRR, vol. abs/1802.06070, 2018. [Online]. Available: http://arxiv.org/abs/1802.06070

[150]    J. Fu, J. D. Co-Reyes, and S. Levine, “EX2: exploration with exemplar models for deep reinforcement learning,” CoRR, vol. abs/1703.01260, 2017. [Online]. Available: http://arxiv.org/abs/1703.01260

[151]    H. Tang, R. Houthooft, D. Foote, A. Stooke, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel, “#exploration: A study of count-based exploration for deep reinforcement learning,” CoRR, vol. abs/1611.04717, 2016. [Online]. Available: http://arxiv.org/abs/1611.04717

[152]    D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jul 2017. [Online]. Available: http://dx.doi.org/10.1109/CVPRW.2017.70

[153]    M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg, “Noisy networks for exploration,” CoRR, vol. abs/1706.10295, 2017. [Online]. Available: http://arxiv.org/abs/1706.10295

[154]    G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos, “Count-based exploration with neural density models,” CoRR, vol. abs/1703.01310, 2017. [Online]. Available: http://arxiv.org/abs/1703.01310

[155]    Y. Burda, H. Edwards, A. J. Storkey, and O. Klimov, “Exploration by random network distillation,” CoRR, vol. abs/1810.12894, 2018. [Online]. Available: http://arxiv.org/abs/1810.12894

[156]    R. Y. Chen, S. Sidor, P. Abbeel, and J. Schulman, “UCB and infogain exploration via $\boldsymbol{Q}$-ensembles,” CoRR, vol. abs/1706.01502, 2017. [Online]. Available: http://arxiv.org/abs/1706.01502

[157]    J. Achiam, H. Edwards, D. Amodei, and P. Abbeel, “Variational option discovery algorithms,” CoRR, vol. abs/1807.10299, 2018. [Online]. Available: http://arxiv.org/abs/1807.10299

[158]    K. Gregor, D. J. Rezende, and D. Wierstra, “Variational intrinsic control,” CoRR, vol. abs/1611.07507, 2016. [Online]. Available: http://arxiv.org/abs/1611.07507

[159]    R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel, “Curiosity-driven exploration in deep reinforcement learning via bayesian neural networks,” CoRR, vol. abs/1605.09674, 2016. [Online]. Available: http://arxiv.org/abs/1605.09674

[160]    D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. van Hasselt, and D. Silver, “Distributed prioritized experience replay,” CoRR, vol. abs/1803.00933, 2018. [Online]. Available: http://arxiv.org/abs/1803.00933

[161]    A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. D. Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, S. Legg, V. Mnih, K. Kavukcuoglu, and D. Silver, “Massively parallel methods for deep reinforcement learning,” CoRR, vol. abs/1507.04296, 2015. [Online]. Available: http://arxiv.org/abs/1507.04296

[162]    L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu, “IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures,” CoRR, vol. abs/1802.01561, 2018. [Online]. Available: http://arxiv.org/abs/1802.01561

[163]    S. Kapturowski, G. Ostrovski, W. Dabney, J. Quan, and R. Munos, “Recurrent experience replay in distributed reinforcement learning,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=r1lyTjAqYX

[164]    J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” CoRR, vol. abs/1705.10528, 2017. [Online]. Available: http://arxiv.org/abs/1705.10528

[165]    G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa, “Safe exploration in continuous action spaces,” CoRR, vol. abs/1801.08757, 2018. [Online]. Available: http://arxiv.org/abs/1801.08757

[166]    W. Saunders, G. Sastry, A. Stuhlmüller, and O. Evans, “Trial without error: Towards safe reinforcement learning via human intervention,” CoRR, vol. abs/1707.05173, 2017. [Online]. Available: http://arxiv.org/abs/1707.05173

[167]    B. Eysenbach, S. Gu, J. Ibarz, and S. Levine, “Leave no trace: Learning to reset for safe and autonomous reinforcement learning,” CoRR, vol. abs/1711.06782, 2017. [Online]. Available: http://arxiv.org/abs/1711.06782

[168]    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 4299–4307. [Online]. Available: http://papers.nips.cc/paper/7017-deep-reinforcement-learning-from-human-preferences.pdf

[169]    N. Hansen and A. Ostermeier, “Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation,” 06 1996, pp. 312 – 317.

[170]    T. Salimans, J. Ho, X. Chen, and I. Sutskever, “Evolution strategies as a scalable alternative to reinforcement learning,” CoRR, vol. abs/1703.03864, 2017. [Online]. Available: http://arxiv.org/abs/1703.03864

[171]    P. H. Jin, S. Levine, and K. Keutzer, “Regret minimization for partially observable deep reinforcement learning,” CoRR, vol. abs/1710.11424, 2017. [Online]. Available: http://arxiv.org/abs/1710.11424

[172]    H. Mania, A. Guy, and B. Recht, “Simple random search provides a competitive approach to reinforcement learning,” CoRR, vol. abs/1803.07055, 2018. [Online]. Available: http://arxiv.org/abs/1803.07055

[173]    R. Gibson, N. Burch, M. Lanctot, and D. Szafron, “Efficient monte carlo counterfactual regret minimization in games with many player actions,” vol. 3, 12 2012.

[174]    I. Adamski, R. Adamski, T. Grel, A. Jedrych, K. Kaczmarek, and H. Michalewski, “Distributed deep reinforcement learning: Learn how to play atari games in 21 minutes,” CoRR, vol. abs/1801.02852, 2018. [Online]. Available: http://arxiv.org/abs/1801.02852

[175]    D. Hafner, J. Davidson, and V. Vanhoucke, “Tensorflow agents: Efficient batched reinforcement learning in tensorflow,” CoRR, vol. abs/1709.02878, 2017. [Online]. Available: http://arxiv.org/abs/1709.02878

[176]    F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from observation,” CoRR, vol. abs/1805.01954, 2018. [Online]. Available: http://arxiv.org/abs/1805.01954

[177]    A. Tavakoli, F. Pardo, and P. Kormushev, “Action branching architectures for deep reinforcement learning,” CoRR, vol. abs/1711.08946, 2017. [Online]. Available: http://arxiv.org/abs/1711.08946

[178]    L. Buesing, T. Weber, Y. Zwols, N. Heess, S. Racaniere, A. Guez, and J.-B. Lespiau, “Woulda, coulda, shoulda: Counterfactually-guided policy search,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=BJG0voC9YQ

[179]    J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” CoRR, vol. abs/1705.08926, 2017. [Online]. Available: http://arxiv.org/abs/1705.08926

[180]    V. François-Lavet, Y. Bengio, D. Precup, and J. Pineau, “Combined reinforcement learning via abstract representations,” CoRR, vol. abs/1809.04506, 2018. [Online]. Available: http://arxiv.org/abs/1809.04506

[181]    J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate to solve riddles with deep distributed recurrent q-networks,” CoRR, vol. abs/1602.02672, 2016. [Online]. Available: http://arxiv.org/abs/1602.02672

[182]    A. Dosovitskiy and V. Koltun, “Learning to act by predicting the future,” CoRR, vol. abs/1611.01779, 2016. [Online]. Available: http://arxiv.org/abs/1611.01779

[183]    M. G. Azar and H. J. Kappen, “Dynamic policy programming,” CoRR, vol. abs/1004.2027, 2010. [Online]. Available: http://arxiv.org/abs/1004.2027

[184]    T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, G. Dulac-Arnold, I. Osband, J. Agapiou, J. Z. Leibo, and A. Gruslys, “Deep q-learning from demonstrations,” 2017.

[185]    M. J. Hausknecht and P. Stone, “Deep recurrent q-learning for partially observable mdps,” CoRR, vol. abs/1507.06527, 2015. [Online]. Available: http://arxiv.org/abs/1507.06527

[186]    H. R. Maei, C. Szepesvári, S. Bhatnagar, and R. S. Sutton, “Toward off-policy learning control with function approximation,” in Proceedings of the 27th International Conference on International Conference on Machine Learning, ser. ICML’10. Madison, WI, USA: Omnipress, 2010, p. 719–726.

[187]    A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente, “Multiagent cooperation and competition with deep reinforcement learning,” CoRR, vol. abs/1511.08779, 2015. [Online]. Available: http://arxiv.org/abs/1511.08779

[188]    W. Dabney, G. Ostrovski, D. Silver, and R. Munos, “Implicit quantile networks for distributional reinforcement learning,” CoRR, vol. abs/1806.06923, 2018. [Online]. Available: http://arxiv.org/abs/1806.06923

[189]    J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver, “Mastering atari, go, chess and shogi by planning with a learned model,” 2019.

[190]    P. Bacon, J. Harb, and D. Precup, “The option-critic architecture,” CoRR, vol. abs/1609.05140, 2016. [Online]. Available: http://arxiv.org/abs/1609.05140

[191]    O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Bridging the gap between value and policy based reinforcement learning,” CoRR, vol. abs/1702.08892, 2017. [Online]. Available: http://arxiv.org/abs/1702.08892

[192]    K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforcement learning in a handful of trials using probabilistic dynamics models,” CoRR, vol. abs/1805.12114, 2018. [Online]. Available: http://arxiv.org/abs/1805.12114

[193]    B. O’Donoghue, R. Munos, K. Kavukcuoglu, and V. Mnih, “PGQ: combining policy gradient and q-learning,” CoRR, vol. abs/1611.01626, 2016. [Online]. Available: http://arxiv.org/abs/1611.01626

[194]    R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3, pp. 229–256, May 1992. [Online]. Available: https://doi.org/10.1007/BF00992696

[195]    G. Chen, Y. Peng, and M. Zhang, “An adaptive clipping approach for proximal policy optimization,” CoRR, vol. abs/1804.06461, 2018. [Online]. Available: http://arxiv.org/abs/1804.06461

[196]    S. Gu, T. P. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine, “Q-prop: Sample-efficient policy gradient with an off-policy critic,” CoRR, vol. abs/1611.02247, 2016. [Online]. Available: http://arxiv.org/abs/1611.02247

[197]    K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y. Yi, “Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning,” 2019.

[198]    A. Gruslys, M. G. Azar, M. G. Bellemare, and R. Munos, “The reactor: A sample-efficient actor-critic architecture,” CoRR, vol. abs/1704.04651, 2017. [Online]. Available: http://arxiv.org/abs/1704.04651

[199]    H. Liu, Y. Feng, Y. Mao, D. Zhou, J. Peng, and Q. Liu, “Action-depedent control variates for policy optimization via stein’s identity,” 2017.

[200]    O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Trust-pcl: An off-policy trust region method for continuous control,” CoRR, vol. abs/1707.01891, 2017. [Online]. Available: http://arxiv.org/abs/1707.01891

[201]    P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. F. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and T. Graepel, “Value-decomposition networks for cooperative multi-agent learning,” CoRR, vol. abs/1706.05296, 2017. [Online]. Available: http://arxiv.org/abs/1706.05296

[202]    G. Dulac-Arnold, R. Evans, P. Sunehag, and B. Coppin, “Reinforcement learning in large discrete action spaces,” CoRR, vol. abs/1512.07679, 2015. [Online]. Available: http://arxiv.org/abs/1512.07679


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s