Reinforcement Learning Frameworks – An Overview

Abstract Reinforcement Learning (RL) has seen renewed interest sparked by the successful combination of RL with neural models as well as Monte-Carlo Tree Search (MCTS). At first, this development was largely restricted to playing traditional games and video games, but successively one can observe more widespread usage in industry as well from robotics and autonomous cars to datacenter and warehouse optimization. This survey takes a look at frameworks for RL which have unlike their deep learning counterparts not yet seen a consolidation out of which only a few winners emerge. It compares them in terms of focus and functionality and arrives at recommendations for future developments.

I. Introduction

Reinforcement learning (RL) is together with supervised and unsupervised learning one of the three pillars of machine learning. It models how an agent interacts with an environment in order to maximize a cumulative reward. While early RL applications like Gerald Tesauro’s TD-Gammon already used RL algorithms (TD-lambda) in conjunction with a neural network for playing backgammon, RL has seen renewed interest in recent times. This was especially caused by the combination of RL with other techniques: On the one hand with deep learning (DL) for function approximation, e.g. using Deep Q Networks (DQNs) to play Atari games. For games like Go or real-world scenarios like autonomous driving the state space can become prohibitively large, but the approximation of values or policies via DL can alleviate this. On the other hand, there is the combination with Monte-Carlo Tree Search (MCTS) and self-play esp. in the Alpha systems which achieved superhuman performance first in Go and then in chess and shogi as well as StarCraft II before being applied to protein folding. While games are a prevalent application of RL, other applications include autonomous driving, resource allocation, data center cooling, business process optimization, neural architecture search, machine reading, dialog management, fleet logistics and omni-channel marketing among others.

Katsunari Shibata [1] pointed out that many traits naturally emerge in end-to-end RL including attention, exploratory behavior, memory and knowledge transfer. DeepMind’s IMPALA scales from one to thousands of nodes and also allows for advanced techniques like multi-task learning rather than restarting from scratch for every task. Other techniques RL can benefit from include Imitation learning to mimick expert behavior and curriculum learning to alleviate learning complex behavior via ”didactically” offering incrementally difficult tasks. For instance, it is challenging for an agent to learn to pull a chair to a wall in order to jump over it, whereas first learning to directly jump over a low wall followed by pulling up the chair converges much faster. RL has also been shown to generalize reasonably well within a domain, e.g. by playing over 50 different Atari games with one model. While most of RL focuses on individual agents, the field of multi-agent reinforcement learning (MARL) comprises multiple agents in cooperative or competitive settings. These characteristics motivate the use of frameworks and platforms to support researchers and engineers both with reusable componentry, but also to handle system aspects and support the plethora of techniques and scenarios RL can be used with. Interestingly, many researchers report to primarily use established DL frameworks like PyTorch or Tensorflow rather than specialized RL frameworks. This survey is intended to present a wide overview of RL frameworks along with their value propositions in order to provide them with a more comprehensive set of tools. Finally, this survey focuses on Deep RL (DRL) – old-fashioned RL frameworks like BURLAP are outside its scope.

II. RL Frameworks

A. Terminology and Taxonomy of RL Frameworks

An RL framework is a piece of software that provides the foundational componentry to build RL applications or experiments including the core algorithms, exploration strategies, replay buffers, pretrained agents and environment interfaces. An RL environment is a system designed to be interacted with by one or many RL agents, usually via taking actions and observing state changes, reward, additional information and termination. We compare RL frameworks based on programming language, type of supported algorithms, ties to other frameworks, scalability, single or multi agent RL, maturity and popularity. We distinguish three main classes of RL frameworks: Traditional ones focusing on model-free approaches, comprehensive frameworks that cover additional approaches like evolutionary strategies or model-based RL and finally frameworks that exclusively focus on MARL.

B. Traditional Model-Free RL Frameworks

The largest class are DRL frameworks which focus primarily on model-free RL, a well-established class of algorithms. Baselines, for instance, is a collection of RL baseline implementations that was started by OpenAI, but is in maintenance status and has been criticized for having suboptimal documentation and modularization. There is a popular fork out of INRIA that improves over the original version in several regards like better documentation, test coverage, Tensorboard and Jupyter notebook support, a common interface inspired by scikit-learn as well as custom policy and callback support (e.g. for monitoring). It also provides more algorithms and a zoo of over 100 pre-trained agents available for it.

Tensorflow has the highest number of DRL frameworks that are based on it. Dopamine [2] aims at rapid prototyping of research ideas, but is restricted to value-based RL. They currently focus on Rainbow extensions like n-step Bellman updates, PER and distributional extensions in an Atari environment (Gym wrapper for ALE), but also aim at reproducibility. There are currently four algorithms available: A basic DQN, Rainbow as well as C51 as a special parameterization of it and IQN. It comes with Tensorboard support and is configured via the Gin framework. DeepMind released TensorFlow Reinforcement Learning (TRFL, pronounced ”truffle”) which is a collection of mathematical primitives for reinforcement learning. While useful for research, it is much more technical and fine grained than the other frameworks discussed here. For instance, it will contain TD(0) learning loss, Expected SARSA (SARSE) loss or V-trace actor critic targets. Several of its functions (e.g. Distributional Double Q-Learning) are tied to Tensorflow. Huskarl [3] is based on Tensorflow 2.0 and Keras. It has aspirations to add Unity environments, curiosity-driven exploration and MARL, but currently primarily works with traditional algorithms against Gym environments. It has basic parallelization support, i.e. it can run multiple environment instances in parallel on several CPU cores, but seems to lack more sophisticated approaches. Keras-RL [4] is a Keras-based DRL framework. While very popular according to Github stars, it does not seem too well modularized and has minor downsides regarding updates, documentation and reliability – at the time of writing its builds are failing. It supports Weights & Biases to plot its metrics and while it seems to have Tensorboard support, its status is unclear. TF-Agents [5], unlike Dopamine and TRFL, has more mature documentation and comes with multi-armed bandit agents and environments. Like Dopamine it uses gin-config and it can leverage TF-Eager for debugging. It also allows for training on multiple instances in parallel via its ParallelPyEnvironment. Tensorflow 2 Reinforcement Learning (tf2rl) is a very young framework that supports model-free RL as well as imitation learning. simple_rl [6] was inspired by BURLAP and focuses on simplicity: Its fundamental concepts are agents (including reference implementations) and environments (based on an MDP class). The latter can be a GymMDP for OpenAI Gym, object-oriented MDPs (OOMDP), a k-armed bandit, a POMDP, a probability distribution over MDPs or a Markov Game, cmp. [6]. The components hide the complexity needed to track experiments and visualize results. Every experiment run creates a file with the exact parameters used that can be leveraged for reproducibility. However, the implementation is still a work in progress and currently limited to pure MDPs. Finally, SimpleRL comes with a set of utilities like a planning module that includes value iteration, MCTS, bounded RTDP and a sparse sampling algorithm.

Tensorforce [7, 8] originated at the University of Cambridge and strives to be language agnostic by keeping the RL logic in TF computation graphs. Moreover, it strictly separated abstract RL algorithms from domain-specific information like concrete input and output structure to keep them universally applicable and it comes with a plethora of components including different memory types, optimization algorithms, training strategies as well as environment adapters for environments like ALE, Gym and Retro. In [8] the authors complain that some open source RL frameworks rely on fixed neural network architectures and may internally apply heuristics to reduce complexity without making this properly transparent to the user. Avoiding this is another goal of the framework. Tensorforce is the backend of LIFT, an end-to-end stack for DRL.

Similarly, there are many PyTorch-based frameworks for model-free RL. Catalyst.RL [9] is part of a larger ecosystem (”Catalyst.Ecosystem”) which also includes Alchemy (experiment logging and visualization), MLcomp (DAG-based ML pipelines with web GUI) and Reaction (model serving). It has support for distributed training and stores all parameters in yaml configuration files for reproducibility. Furthermore, it leverages TensorboardX to visualize metrics. On-policy algorithms and off-policy algorithms for discrete control settings are still absent from the library. Lagom [10] covers both model-free RL and evolutionary strategies. It supports basic parallelization and can perform hyperparameter optimization via both grid search and random search. Furthermore, it has some basic visualization support. While most RL frameworks do not incorporate all three paradigms of model-free RL, i.e. Q-learning, policy gradient and Q-value policy gradient approaches, rlpyt [11] is focused on supporting all three in a common framework and targets small to medium scale research projects. Its main components are a collection of modular RL componentry as well as infrastructure for parallel execution. It uses PyTorch’s solutions for multi-GPU (NCCL) and multi-CPU optimization (gloo). It does not support fully asynchronous optimization schemes, but can run the sampler and optimizer asynchronously via a replay buffer. Historically, rlpyt originated at Berkeley and is based on accel_rl which in turn was strongly influenced by rllab (now garage). SLM-Lab provides modularized versions of many esp. model-free RL algorithms, but is unique in that it uses class inheritance to represent which algorithms followed each other in research. For instance, SARSA influenced DQN on which Double DQN and Double DQN with PER were based, so this will be the inheritance structure in SLM-Lab. SURREAL [12] out of Stanford focuses on robotics as well as distributed RL training and simulation. According to their repository they can ”scale to thousands of CPUs and hundreds of GPUs”. They combine on-policy (PPO) and off-policy (DDPG) approaches by following an actor model for producing experience data in parallel with a centralized buffer and model learner (with multi-GPU capabilities). For on-policy training this buffer can just be a FIFO queue and train directly, whereas for off-policy learning it is a replay memory that allows for batch sampling of collected experience. In order to support reproducibility as well as scalability SURREAL distinguishes four infrastructure layers: Its provisioner is used to provision cloud resources, the Kubernetes-based orchestrator provides the API, the SURREAL protocol handles communication and the algorithms are the actual RL implementations. Their direct target is the SURREAL Robotics Suite with tasks such as block lifting and stacking as well as nut-and-peg assembly.

Facebook ReAgent [13] is targeted at production, e.g. to optimize streaming ABR for 360 Video, for M suggestions in Messenger and to maximize notification relevance. Its training is executed in PyTorch (including distribution), whereas serving is done in Caffe via ONNX. With large datasets and slow feedback loops ReAgent is forced to choose different approaches than pure research frameworks. For instance, data preprocessing can be run via Spark and there is a feature classification mechanism that distinguishes feature types (binary, probability, continuous, enum, quantile or boxcox) to derive how to normalize them during training. Furthermore, ReAgent employs Counterfactual Policy Evaluation to estimate agent performance offline and thus avoid extensive A/B testing (even though it can be combined with it) and degradation of user experience. The conflict between conventional RL algorithms that benefit from shuffling in order to obtain pseudo-i.i.d. data and CPE which requires cumulative, step-wise data is addressed by sampling data during training and sorting it at the end of each epoch to retrieve the original sequence and then conduct CPE. All of the supported algorithms are off-policy and thus do not require exploration at runtime and can hence wait days for a reward signal. ReAgent currently supports various types of DQN as well as DDPG and SAC.

Some RL frameworks are built upon other DL frameworks: ChainerRL [14] is based on Chainer. It delivers a wide selection of algorithms, exploration techniques, neural network architectures, replay buffers, distributions and action values spanning different training approaches, i.e. serial as well as synchronous and asynchronous parallel training, and comes with a dedicated visualization framework called ChainerRL-Visualizer. For reproducibility it provides single file implementations of papers which have been verified to authentically reproduce the published results. Paddle Paddle Reinforcement Learning (PARL) seems similar to Dopamine in that it focuses on only a few RL algorithms – DQN, DDPG and PPO. It has three basic abstractions – a model that is the policy or critic network, the algorithm that updates the model’s parameters and finally the agent component that connects the algorithm with an environment. Intervention Aided Reinforcement Learning (IARL) seems prepared, but was still absent at the time of writing.

However, it is not clear that depending on any particular DL framework is desirable. Some frameworks strive for agnosticism instead: Garage [15], the successor of rllab, originated at Berkeley and OpenAI and supports both PyTorch and Tensorflow. Besides the typical RL components like algorithms, replay buffers and samplers it offers Tensorboard integration, reproducibility features and checkpointing. MushroomRL [16] covers value-based, policy-based and actor-critic methods. While generally agnostic to frameworks, it generally bases its DRL code on PyTorch. A goal of the framework is to support rapid prototyping by providing a comprehensive set of components that are easy to combine and extend (e.g. via callbacks) while hiding low-level details. One way to achieve this is its common interface for RL techniques which comprises both shallow and deep RL, on- and off-policy methods, batch and online training as well as episodic and infinite horizon tasks.

C. Comprehensive Frameworks

Many frameworks go beyond model-free RL. DeeR covers a wide range of RL classes, i.e. value-, policy- and model-based RL. While it is largely agnostic to DL frameworks, its examples use Keras. One unusual concept is that it provides controllers which can adapt parameters during training.

Ray [17] is a real-time AI platform developed by Berkeley’s RISELab and includes a wider ecosystem including the RL components library RLlib [18] which resides on top of Ray, Tune for HPO, etc. Ray is based on the realization that emerging AI workloads differ from previous workloads in that rather than single predictions they require sequences of actions in dynamic rather than static environments where rewards might be delayed as opposed to immediate feedback in traditional ML settings like supervised learning. As a result, these workloads are much more heterogeneous – rather than merely maximizing GPU utilization in batch-style training, they might have more CPU-intensive and more GPU-intensive phases and they might be modeled better as dynamic task graphs. They are also more likely to combine entirely different approaches like DL, RL, Automated Planning (AP), reasoning, Monte-Carlo Tree Search (MCTS) and simulations with fine-grained data and arbitrary task dependencies. At the same time, they need to scale to hundreds or even thousands of nodes with sub-millisecond latencies with up to millions of tasks per second while remaining fault tolerant. Ray is based on annotations and thus minimally invasive to a codebase. By adding @ray.remote to functions, they get converted into asynchronously callable remote methods that covertly put their arguments into an object store and replace their original return value by a future. The ray.remote annotation can contain CPU and GPU requirements, e.g. @ray.remote(num_gpus=1), as well as custom resources like datasets, accelerators like FPGAs or neuromorphic devices, but also memory configurations. By adding the same annotation to classes, they get converted into actors, thus delivering actor-based programming (not unlike SURREAL) through the same mechanism. The main difference between actors and remote functions is that actors can carry state which is handy for simulators or neural networks.

Intel RL Coach [19] provides imitation learning and MARL besides value- and policy-based methods. It goes beyond OpenAI Gym and also supports environments like DeepMind Control Suite, Starcraft II, CARLA Gym Extensions and Roboschool. To improve reproducibility Coach employs rigorous testing (called Benchmarks) that run each algorithm against a subset of the environments used in the original paper to ensure the results match the published claims. Coach provides a dashboard that can not only compare multiple experiments, but also show e-metrics in real time as well as for multiple actors if the algorithm uses them (e.g. A3C). Coach can horizontally scale out these rollout workers with synchronization either being synchronous for on-policy methods or asynchronous for off-policy training. Similar to Ray’s dynamic task graph Coach represents agents and environments in a graph. For hierarchical RL (HRL) settings this graph can become complex and contain multiple levels and master policy agents can direct sub-policy agents. This mechanism has three stages – a heatup to fill the replay buffers, the actual training phase where the agent runs against the environment to learn a policy and finally an evaluation phase where the agent only exploits the learned policy (averaged over multiple runs) in order to assess its performance. Additional features include its input embedders, middleware layer and output heads structure for neural networks.

RLgraph [20] is focused on building backend-agnostic component graphs. It thus separates component composition from the actual backends for deep learning and distribution. For instance, it can run as a Tensorflow compute graph, in PyTorch, on top of Ray or via Uber’s / LFAI’s MPI-based Horovod. It also allows testing of subgraphs and generally accelerates rapid prototyping. The fact that the authors of Tensorforce and RLgraph largely overlap is embodied in the agent API which is similar between both projects.

D. MARL Frameworks

PyMARL is a framework out of Oxford’s Whiteson Research Lab. It is based on PyTorch and was released in conjunction with SMAC, the Starcraft Multi-Agent Challenge which is also its target environment. While MARL capabilities of frameworks like PyMARL, RLlib and Coach are generally targeted at only a few agents, MAgent [21] focuses on many-agent reinforcement learning, i.e. settings with up to one million agents on one GPU server, to research Artificial Collective Intelligence (ACI). It comes with three settings: Pursuit (yielding predator formations for hunting prey), gathering (resulting in food gathering behavior) and battle (emerging mixture of collaboration and competition).

III. RL Environments

In Artificial Intelligence one usually distinguishes environments based on seven to eight axes:

Is it simulated, situated or embodied? Simulated environments run in a separate simulation process, in situated settings the agent operates directly in an environment and embodied means that the agent has a physical manifestation in the real world.
Is it static or dynamic? Dynamic environments can change while the agent takes an action, static ones cannot.
Are the action and observation space discrete or continuous, i.e. is there a fixed number of actions or perceptions respectively?
Is it fully or partially observable, i.e. can the agent observe the entire environment at once?
Is it episodic or sequential? In episodic environments the agent gets independent rewards for every action, whereas in sequential settings it only gets a reward after a number of steps.
Is it a single or multi-agent environment?
Is the environment known or unknown, i.e. does the agent have a model of the environment dynamics? This axis is also reflected in model-free vs model-based RL and might thus depend more on the type of RL than the type of environment.

The arguably most relevant environment is OpenAI Gym [22] which is a standardized interface for single agent RL. It comes with eight types of environments: Textual ones, algorithmic ones, Atari 2600 games (with pixels or RAM content as input), continuous control tasks in the Box2D simulator, continuous control tasks in the commercial MuJoCo simulator, classic control theory tasks, simulated robotics goal-based tasks and custom environments. Interacting with a Gym environment tends to follow the same pattern: The environment is reset in step 0 which returns the first observation. Afterwards, steps are taken each of which returns a 4-tuple consisting of the next observation, reward, a boolean indicating whether the episode has ended and a field with debug information that should not be used in the agent’s decision process. Gym is widely used and supported by virtually all RL frameworks.

However, Gym environments are generally less well suited for specialized settings like MARL or multi-modality. Most environments are either 2D or 3D games including video games, video game engines and traditional games like card or board games. Regarding 2D this includes the Arcade Learning Environment, Facebook ELF, the MAME Toolkit, Multi-Agent Particle Environment, OpenAI Retro (which supercedes RLE and Universe), DeepMind OpenSpiel and the Hanabi Learning Environment among others. Regarding 3D this includes AI2-THOR, AnimalAI Olympics, CHALET, DeepMind Lab, DeepMind SC2LE, Facebook House3D, Habitat, Holodeck, HoME, Malmö, Matterport3D, MIT ThreeDWorld (TDW), the OpenAI Multiagent Competition, OpenAI RoboSumo, Unity ML-Agents and VizDoom.

Another particularly popular class are gridworlds and mazes including BabyAI, DeepMind pycolab, Facebook MazeBase, mazelab and many others. A new class of environments is specifically targeted at control and safety, e.g. DeepMind AI Safety Gridworlds, DeepMind Control Suite and Safety Gym. However, many other RL environments do exist, for instance for trading, task queueing, emergency response, scheduling and theorem proving.

Finally, it is feasible to leverage existing simulation software within RL environments, in particular for robotics including autonomous cars, drones and other vehicles, but also industrial robots, e.g. AirSim, Gibson env, Gym Gazebo, MINOS, MuJoCo, the Neurorobotics Platform (NRP), OpenAI RoboSchool, PaddlePaddle RLSchool, robosuite and the S-RL Toolbox. Furthermore, this includes discrete event simulation like AnyLogic, Siemens PLM or SIMUL8, chemical and engineering simulations like CHEMCAD, MATLAB & Simulink or Sinumerik, but also medical or pharmacological (e.g. GastroPlus), networking (e.g. CloudSim or NS3), military (e.g. BISim VBS4), transportation (e.g. Anylogistix) and even urban or governmental planning simulations (e.g. UrbanSim).

IV. Visualization

Visualization and dashboards are important to track and debug reinforcement learning algorithms and training progress. Currently, there are three major dashboard options for RL: For TensorBoard there is an interpretability dashboard extension by Andrew Schreiber that was recently added to TensorBoard. One distinguishing factor in this extension is that it can render the environment with added perturbation saliency heatmaps which visualize where the model attention is focused. Some RL frameworks like Stable Baselines also have explicit instructions on how to integrate with TensorBoard. If the underlying DL framework is not Tensorflow, but e.g. Chainer or PyTorch, TensorboardX can be used. An alternative is Facebook’s Visdom for more human-like RL. Intel Coach comes with its own dashboard that can visualize signals from several workers – A3C, for instance, spawns multiple actors. It also allows combining multiple runs of the algorithm into one set to account for the fact that RL algorithms tend to be unstable.

While Guo et al. [23] are primarily interested in applying DQNs to Atari games, they visualize their first- and second-layer filters and additionally use the optimal stimuli method to show which features the CNN learned, thus conveying which input patches generated the largest response.

Mnih et al. [24] use two-dimensional t-SNE embeddings to visualize the representation learned by their DQN by recording the last hidden layer representations of the DQN for each game state for 2 hours of gameplay. While it is not surprising that this procedure mapped visually similar states to points that are spatially close to each other, it is interesting that it also did so for states that are similar in estimated value. Furthermore, when mapping both human and AI states into the same space both reveal a similar structure which indicates that the learned representations generalize to data that originated from a foreign policy. Finally, they also visualize the value and action-value functions over time.

Zahavy et al. [25] contribute both a methodology and a set of tools to understand DQNs which they use to explain three Atari games. For instance, they employ three-dimensional t-SNE representations to visualize state transitions and leverage saliency maps to highlight which image regions have the highest impact on the neural network’s value predictions. Finally, they introduce Semi Aggregated MDPs (SAMDPs) that provide clear spatio-temporal abstractions leading towards subgoal detection.

V. Industry Solutions

The following section gives a broad overview of RL usage in industry. It is clearly desirable that future DRL frameworks are more targeted towards productive use in Enterprise and industrial applications.

AWS has launched Sagemaker RL on November 28th 2018. They support Intel Coach, Ray RLlib and Baselines as toolkits that provide agent implementations and differentiate four kinds of environments: The AWS-specific simulation environments Sumerian and RoboMaker, open source environments like RoboSchool, Gym or EnergyPlus, custom environments and commercial simulators like MATLAB with Simulink. The latter three are offered via custom containers, MATLAB additionally requires the user to manage her own license. Note that Baselines seems to be Stable Baselines. Furthermore, Sagemaker RL supports distributed training and HPO.

Bonsai is a Berkeley-based startup that was acquired by Microsoft on June 20th 2018. It currently has 144 employees according to LinkedIn with a Series A funding of $13.6 million according to Crunchbase. Besides Bonsai Microsoft also recently acquired Maluuba to form Microsoft Montreal and thus its Reinforcement Learning Group. They are primarily working in Optimization, Control as well as Monitoring and Maintenance. They integrate with Matlab and Simulink to integrate with engineering models developed with these tools like wind turbine controls. http://prowler.io/ is a Cambridge, UK-based startup with a Series A funding of $14.9 million according to Crunchbase. Their Vuku architecture has two main components: A decision making component is used to choose actions and a learning system learns predictive models.

In research DeepMind and OpenAI are among the most famous institutes. OpenAI also has an internal platform called OpenAI Rapid that seems to be primarily used by their DotA team, but that is also offered to other teams within the company. Borealis AI is an RL-focused research institute out of the Royal Bank of Canada. Imandra provides reasoning as a service including symbolic reasoning and formal verification, but also incorporates DRL with applications in finance, robotics and autonomous systems. nnaisense, Jürgen Schmidhuber’s research-focused company, applies RL to wide range of applications including autonomous systems and finance.

Robotics is one core field of RL. micropsi industries is based in Germany and provides robotics software and in particular the machine teaching & control system MIRAI which is trained end-to-end and targeted at assembly tasks. Osaro provides combined computer vision and decision making solutions for robotic systems, esp. in warehouse automation for distribution centers and manufacturing. Covariant (formerly Embodied Intelligence) is a California-based startup co-founded by Pieter Abbeel for machine teaching via deep RL and imitation learning. AI4Things build DRL-based solutions for intelligent machines. DoraBot provides robotics for logistics and SoarTech provides autonomous systems and decision support for military applications.

Many companies in autonomous vehicle research (AVR) leverage RL including traditional car manufacturers like BMW, Mercedes Benz, transportation service providers like Lyft’s Palo Alto-based Level 5 Self-Driving Division and Uber’s Advanced Technologies Group as well as companies like Aptiv, Waymo or Bosch. Oxford-based LatentLogic builds realistic behavior models via imitation learning for humanoid agents in simulations for autonomous cars by first extracting and then imitating behavior. Wayve out of Cambridge University develops a self-driving car software platform.

In finance, AI Capital Management and HiHedge offer DRL solutions for trading. In e-Learning Qstream uses it for Microlearning, Desire2Learn provides a learning management system with DRL-based learning material recommendation. In business process optimization CogitAI develops a DRL-based continual learning platform called Continua, DataOne a platform for intelligent business decisions, InstaDeep process optimization for the energy sector, logistics, manufacturing and mobility and MediaGamma develops DRL-based decision support, e.g. for advertising and customer acquisition. PerfectPattern leverages DRL for industrial process control and management. PerimeterX offers predictive security against botnet attacks. Phenomic and ProteinQure apply the technology to drug discovery. Rasa leverages DRL approaches for chatbots, dialog systems and virtual assistants. ThruAI enriches customer service with RL. OPTIMAL is a London- and Rotterdam-based company developing autonomous indoor farming solutions.

This brief market overview is also aimed at showing the discrepancy between requirements in industry and the current focus of most DRL frameworks, e.g. regarding applications, lifecycles and toolchains.

VI. Discussion

The consolidation we observed with deep learning frameworks still has to happen for RL. Most DRL experts seem to use PyTorch and TF directly rather than RL frameworks and frequently cite the amount of flexibility and control they need for their research rather than just combining existing components. MARL, model-based RL, safety and multimodality are oftentimes ignored, since the majority of frameworks focus on subsets of RL, esp. model-free approaches. Reproducibility, an inherent challenge since RL agents interact with dynamic environments, is beginning to emerge – Catalyst, ChainerRL, garage and Dopamine address it explicitly. Distribution is now more widely addressed, but oftentimes frameworks only support single-node parallelism rather than true multi-node horizontal scalability. While there are individual exceptions including Ray and SURREAL, cloud-native architecture and Kubernetes are usually not leveraged, even though Common Resource Definitions (CRDs) might work very well as infrastructure abstractions for agents. Similarly, while evolutionary strategies are supported by frameworks such as garage, Lagom and RLlib and imitation learning by RLlib, RL Coach, SLM Lab, tf2rl and TF Agents, general mechanisms for curriculum learning, hierarchical RL and hybrid agent design in general still seem underdeveloped.

From a larger perspective, it is striking that DRL systems exhibit behavior that is reminiscent of human intuition and are capable of finding novel strategies, but seem to struggle with deep analysis one would expect from symbolic approaches which leads us to suspect that neural-symbolic approaches could prove vital in developing AI systems that exhibit both characteristics. Finally, RL frameworks and applications are still predominantly research and game oriented and usually lack support for industrial and Enterprise applications like digital twins or lifecycle management from coarse-to-fine-grained simulation up to real-world deployment. There is still no interface for MARL and multimodality that is as universally accepted as Gym. Thus, we expect a significant shift in the medium term towards more consolidated frameworks that are closer to real-world applications, leverage cloud native architecture more naturally and provide a wider selection of techniques.


Name	Github Stars	License	Framework / Language	Algorithms	Integrated Environments	Distributed Execution	Type of RL?	Types of Componentry	Affiliation

OpenAI Baselines	9200	MIT		A2C, ACER, ACKTR, DDPG, DQN, GAIL, HER, PPO, TRPO			Model-Free RL		OpenAI

Stable Baselines	1600	MIT		A2C, ACER, ACKTR, DDPG, DQN, GAIL, HER, PPO, SAC, TD3, TRPO			Model-Free RL		INRIA, ParisTech

CatalystRL	1400	Apache 2.0	PyTorch	DQN, DDPG, PPO, SAC, TD3	Gym incl. Atari, DM Control Suite	Yes	Model-free RL		Dbrain

ChainerRL	793	MIT	Chainer	DQN, DDQN, Categorical DQN, Rainbow, IQN, DDPG, A2C, A3C, ACER, NSQ, PCL, PPO, TRPO, TD3, SAC, PAL, Double PAL, DPP, REINFORCE	Gym, ALE, Mujoco, Bullet	Only multiprocessing	Model-Free RL	Dedicated Visualizer	Preferred Networks

DeeR	435	BSD	Agnostic, Keras	DQN, DDPG, CRAR	Gym, ALE, PLE		Model-Free and Model-Based RL

Dopamine	8600	Apache 2.0	Tensorflow, Keras	DQN, C51, Rainbow, IQN			Value-Based RL		Google

garage	587	MIT	PyTorch, Tensorflow	CEM, CMA-ES, REINFORCE, DDPG, DQN, DDQN, ERWR, NPO, PPO, REPS, TD3, TNPG, TRPO			ES, Model-Free RL		Berkeley, OpenAI

Huskarl	383	MIT	Tensorflow, Keras	DQN, Multi-Step DQN, Double DQN, Dueling DQN, A2C, DDPG, PER, PPO	Gym, Unity planned	Parallelization only	Model-Free RL

KerasRL	4400	MIT	Keras	DQN, Double DQN, DDPG, CDQN / NAF, CEM, Dueling DQN, SARSA	Gym, extendable		Value-Based RL

Lagom	355	MIT	PyTorch	CEM, CMA-ES, OpenAI-ES, VPG, PPO, DDPG, TD3, SAC	Gym, DM Control Suite via dm2gym	Parallelization only	ES and Model-Free RL

MAgent	1100	MIT	Agnostic with baseline algorithms in MXNet and Tensorflow	DQN, DRQN, A2C		Scales to 1M agents on single GPU server, multi-server support unclear	MARL (Many-Agent)		Geek.ai

Mushroom RL	278	MIT	PyTorch, Agnostic	Q-Learning, SARSA, FQI, DQN, DDPG, SAC, TD3, TRPO, PPO, LSPI, PGPE, RWR, eNAC, REINFORCE, GROMDP, REPS, COPDAC-Q, R-Learning, A2C, Stochastic Actor-Critic, True Online SARSA-λ, Expected SARSA	Gym, DeepMind Control Suite, MuJoCo, PyBullet, ROS		Model-Free RL		MPI, TU Darmstadt, Politecnico di Milano

PARL	570	Apache 2.0	PaddlePaddle	DQN, DDPG, PPO, IMPALA, A2C, TD3, SAC		Yes	Model-Free RL		Baidu

PyMARL	357	Apache 2.0	PyTorch	QMIX, COMA, VDN, IQL, QTRAN	Tied to SMAC as environment		MARL

Ray, Rllib	10200	Apache 2.0	Ray, Python, DL framework agnostic	DDPG, TD3, A2C, A3C, PPO / APPO, IMPALA, Ape-X, DQN, Rainbow, Vanilla Policy Gradient, SAC, ARS, ES, QMIX, VDN, IQN, MADDPG, MARWIL		Yes	ES, MARL, Model-Free RL		Berkeley


Name	Github Stars	License	Framework / Language	Algorithms	Integrated Environments	Distributed Execution	Type of RL?	Types of Componentry	Affiliation

ReAgent	2400	BSD	PyTorch, TorchScript	DQN, Double DQN, Dueling DQN, Dueling Double DQN, C51, QR-DQN, TD3, SAC			Model-Free RL		Facebook

RL Coach	1600	Apache 2.0	Tensorflow	DQN, BootstrappedDQN, UCB via Q Ensembles, QR-DQN, DDQN, Dueling DDQN with PER, MMC, NEC, N-Step Q-Learning, PAL, NAF, Categorical DQN, Rainbow, PG, A3C, PPO, SAC, ACER, Clipped PPO, DDPG, DDPG with HER, DDPG HAC, TD3, Wolpertinger, DFP, Behavioral Cloning, CIL	CARLA, Gym, Gym Extensions, Roboschool, ViZDoom, PyBullet, StarCraft, DeepMind Control Suite	Yes	IL, MARL, Model-Free RL	Algorithms, Exploration Techniques, Memory Types	Intel

Rlgraph	220	Apache 2.0	Agnostic including TF, PyTorch and Ray	DQN, Double DQN, Dueling DQN, PER, DQFD, Ape-X, IMPALA, PPO, SAC, A2C, A3C, REINFORCE		Yes	Model-Free RL		University of Cambridge, rlcore, Helmut Schmidt University

rlpyt	1200	MIT	PyTorch	A2C, PPO, DQN, Double DQN, DDPG, TD3, SAC, Dueling DQN, Categorical DQN			Model-Free RL		Berkeley

simple rl	124	Apache 2.0	Python	Q-Learning, Rmax, DelayedQ, DoubleQ, Random, DQN, LinUCB, Linear Q-Learning, Planning (Value Iteration, Bounded RTDP, MCTS)			Value-Based RL, Planning

SLM Lab	656	MIT	PyTorch	SARSA, DQN, Double DQN, Dueling DQN, PER, REINFORCE, A2C, PPO, SAC, SIL	Gym, Roboschool, VizDoom, Unity	Parallelization only	Model-Free RL, IL

SURREAL	391	BSD	PyTorch	DDPG, PPO		Yes	Model-Free RL		Stanford

Tensorforce	2600	Apache 2.0	Tensorflow	DQN, Double DQN, Dueling DQN, n-step DQN, NAF, REINFORCE, A3C, PPO, TRPO, DPG	ALE, Gym, MazeExplorer, Retro, OpenSim, PLE, ViZDoom	Parallel Execution of Agent and Environment	Model-Free RL

TF Agents	1100	Apache 2.0	Tensorflow, Keras	DQN, DQN-RNN, DDQN, DDPG, TD3, REINFORCE, PPO, PPO-RNN, SAC, Behavioral Cloning	Gym, Atari, Mujoco, PyBullet, DM Control Suite, Unity ML Agents	In Development (Multi GPU, TPU)	Model-Free RL, IL	Agents, Environments, Replay Buffer	Google

TRFL	2800	Apache 2.0	Tensorflow				Model-Free RL		DeepMind

VII. Addendum

A. Frameworks

AgentNet Obsolete DRL library based on Theano and Lasagne
ALL – Autonomous Learning Library (docs) out of UMass, supports A2C, C51, DDPG, DQN, PPO, Rainbow, SAC, built on PyTorch
Ax – Configuration optimization framework based on multi-armed bandits and Bayesian Optimization
beliefbox Obsolete RL framework with algorithms, environments and inverse RL models
BURLAP, source
Catalyst.RL [9]
ChainerRL and ChainerRL-Visualizer [14]
contextualbandits [26]
DeepRL – Collection of RL algorithms in PyTorch [27]
DeeR
Dopamine [2]
Facebook ReAgent [13], was Horizon
FlashRL [28]
garage [15], docs, supercedes rllab
Huskarl [3]
Intel Reinforcement Learning Coach [29]
Keras-RL [4]
Lagom [10]
MAgent [21]
MDP Toolbox for Python based on INRIA’s MDP Toolbox
Maja (MMLF) Obsolete Python-based RL Framework
MushroomRL [16]
Paddle Paddle Reinforcement Learning (PARL)
PIQLE Obsolete Java-based RL framework
PyBrain – general machine learning library that contains reinforcement-learning algorithms [30]
PyMARL – MARL framework
QLearning4K
Ray [17] from Berkeley’s RISElab and RLlib [18]
RL4J
RL-Glue [31]
RLgraph [20]
RLPy, docs [32]
rlpyt [11]
RoboTurk [33] – framework for crowdsourcing of high-quality demonstrations for imitation learning, done by Stanford Vision Lab
simple_rl [6]
SLM-Lab Modular Deep Reinforcement Learning framework in PyTorch
Stable Baselines, docs, blog – Fork of OpenAI Baselines with unified structure and code style as well as test and documentation improvements
SURREAL [12], source, video TeachingBox Java-based and robot-centric toolbox for robot learning via RL
Tensorflow 2 Reinforcement Learning (tf2rl)
Tensorforce [7, 8]
TF-Agents [5] – Tensorflow library for reinforcement learning
Torch-twrl
TensorFlow Reinforcement Learning (TRFL)
VowpalWabbit

B. Environments

The following environment list was influenced by RLenv.directory.

Comprehensive Multi-Domains or Hybrid Environments

MDP [34] – Provides a simple DSL to define MDPs compatible with Gym, can render MDPs via GraphViz
OpenAI Gym (also see gym-extensions for auxiliary tasks) [22]
PaddlePaddle XWorld – simulators for RL research, contains 2D and 3D environments
PyGame Learning Environment (PLE) – PyGame interface modeled after ALE interface supporting several games
rlenvs – Gym-compatible environments for Torch, inspired by RL-Glue, end of life

C. 2D Games and Environments as well as Traditional Games

This section includes gridworlds.

2D Video Games and Simulations

Abadia Gym – Stochastic Gym environment for AbadIA (abbey of crime video game)
ALife / BugWorld – Artificial life project in simple 2D world combining evolution and RL for emerging complex behavior
Arcade Learning Environment [35]
Clean Up World Gym – Gym-style environment for Hierarchical RL
Deep RTS – Simple 2D-grid-based Real-Time Strategy (RTS) simulator by Centre for Artificial Intelligence Research (CAIR) at University of Agder, Norway
Domination-Game, docs – Competitive MARL simulation
Facebook ELF [36]
Gym Snake – Gym environment for snake including multiple snakes for MARL
Gym Sokoban – Gym environment for Sokoban
Gym Super Mario Super Mario Bros. (NES) levels as Gym environments
HearthEnv – Gym environment for Fireplace Hearthstone simulator (strategy card game / video game), supports human input
Highway Env – Environments for tactical decision making and behavioral planning in autonomous driving
HighwaySim – Simulator to collect data for car-related RL tasks
LF2Gym [37] – Environment for the 2.5D fighting game Little Fighter 2
MADRL [38] – MARL environments pursuit evasion, waterworld, multi-agent walker and multi-ant based on Gym and forked rllab
MAME Toolkit – Arcade game environments via wrapper for the MAME multi-purpose emulation framework
mario-ai (MarIO), Deep Q-Learning with replay buffer to play Super Mario World based on pixel input
Metacar, code – 2D environment for autonomous cars in browser
Multi-Agent Learning Environments [39] – Several MARL environments
Multi-Agent Particle Environment [40, 41] – MARL particle environment
OpenAI Gym Berkeley-Pacman – Gym environment of Berkeley Pacman, uses images as states, optionally partially observable
Obstacle Env – Gym environment for obstacle avoidance
OpenAI Retro [42] – helps turning video games into Gym environments, supports 1000 games, supercedes Retro Learning Environment [43] and OpenAI Universe
Playground – MARL version of Bomberman called Pommerman in three variants – Free For All (FFA), Team and Team Radio where each agent can communicate 2 words from a dictionary of size 8 at each step
Pokémon Battle RL Environment – Gym environment for Pokémon battles, battle simulator architecturally pluggable, but currently Pokemon Showdown is only option
Serpent.AI – Stopped development in November 2018, but was framework for developing game agents for any game, preferred native execution over Docker and VNC
ShipEnv – Simple PyGame-based ship environment
Slitherin – Snake environment, currently single snake only, based on Python 2.7, blog posts: 1, 2, 3, same author also has Atari environment
Space Battle – Simple MARL environment for abstract space ship battles
Tetris RL – PyTorch-based Tetris environment with Gym-like interface

Board Games and Other Traditional Games

DeepMind OpenSpiel
Easy21 – RL environment (and agent) for modified Blackjack
Gym Gomoku – Gym environment for Gomoku (Five-in-a-Row)
Hanabi Learning Environment – Hanabi card game environment with interface similar to Gym from DeepMind
Hold’em – Gym environment for No-Limit Texas Hold’em (NLTH), synchronous and can support arbitrary number of players
Troccas – Environment for regional card game Troccas

Grid Worlds and Mazes

BabyAI [44] – Platform from Mila for baby AI in gridworld, supports executing with human-in-the-loop based on Gym, PyTorch and PyQt, contains NLU components (”baby language”)
DeepMind pycolab
Facebook MazeBase [45]
Grid Soccer Simulator – Multi-agent soccer in gridworld
Gym Minigrid
Krazyworld – Gridworld environment for testing exploration in meta-RL setting
mazelab [46] (was gym-maze), creates grid and maze worlds
Multi Agent DRL for Autonomous Vehicle Relocation – Grid world environment for city fleet coordination

3D Video Games and Non-Realistic Simulations

AI2-THOR – The House Of inteRactions [47], realistic environment for visual AI by Allen Institute
AnimalAI Olympics [48]
CHALET [49, 50] – Cornell House Agent Learning EnvironmenT, 3D house simulator for navigation and manipulation tasks, has set of combinable rooms
DeepGTAV
DeepMind Lab Quake 3 Arena-based virtual environment [51], includes DeepMind Psychlab [52]
DeepMind SC2LE with Python component pysc2, TorchCraft (StarCraft)
Droid – Unity package for prototyping RL agents, similar to Unity ML Agents, part of Neodroid platform
Habitat [53] – platform for embodied agent research
Facebook House3D [54]
Gym-TORCS – Environment for racing simulator TORCS, similar to Gym, but not fully compatible
Gym UnrealCV – Gym-UE integration for visual RL based on UnrealCV, can use their ModelZoo
Holodeck [55], source
HoME [56], code – Household Multimodal Environment with over 450000 3D house layouts, based on Panda3D, uses EVERT engine for 3D acoustic ray-tracing
Malmö [57] – Minecraft environment
Matterport3D [58, 59] – RL Platform for 3D environments from real panoramic RGB-D images
MineRL – Minecraft dataset
Multiworld – Multitask environments
OpenAI Multiagent Competition – Competitive MARL environments for [60] which showed that multi-agent environments with self-play can yield much more complex behavior than environment implies, end of life
OpenAI RoboSumo – competitive multi-agent environments [61]
RL 4 Biped [62] – Bipedal walking robot (BWR) using DDPG in Gazebo environment
rlTORCS – Environment for modified version of TORCS, requires Torch
StarCraft Multi-Agent Challenge (SMAC)
TerrainRLSim – Physics-based simulation environments
ThreeDWorld (TDW) – Unity-based generator for 3D environments out of MIT
ToriLLE (Toribash Learning Environment) [63] – Gym environment for the Toribash fighting game that is reminiscent of MuJoCo, requires Wine on unixoid OSs
UETorch (Unreal Engine)
Unity ML Agents [64], code (Unity Engine), Unity ML Environments – Environments on top of Unity ML Agents; marathon-envs: set of high-dimensional continuous control environments for Unity ML Agents, Obstacle Tower Environment
VizDoom (Doom) [65]

Robotics and Realistic Simulations
This section includes autonomous verhicle research.

Acrobot V-REP – Gym environment for acrobot on V-REP platform with DDPG algorithm, build on Keras-RL
AirGym – AirSim integration for Gym and Keras-RL for autonomous quadrocopter
AirSim [66] – Simulator for diverse vehicles like drones & cars, supports both UE and Unity as well as hardware-in-loop
Factory RL Gazebo – Youbot in factory environment based on gym-gazebo
Gibson env [67], source – Virtual environment simulator, has integration for onboard camera (”Goggles”)
Gym-Duckietown – Self-driving car Gym environments for Duckietown platform, platform started at MIT, simulator at Mila, also includes features for transfer to robot
Gym Gazebo [68] – Extension to original Gym for robotics via Gazebo and ROS
Gym V-REP – Gym extension based on V-REP
MINOS [69] – Simulation for multisensory indoor navigation models
MuJoCo [70], usually used via mujoco-py
Multi-contact-grasping – Grasp-and-lift process with Barrett Hand in V-REP
Neurorobotics Platform (NRP), (source)
OpenAI RoboSchool – deprecated in favor of PyBullet
PaddlePaddle RLSchool – RL environments for PaddlePaddle, currently includes elevator and quadrocopter simulation
PyBullet Gymperium – Alternative, Gym-compatible and free implementations of MuJoCo environments via Bullet physics engine with some Tensorforce agents
robosuite, designed to work well with SURREAL
Robot Gym – Gazebo Gym environment for RL-based and evolutionary robotics, runs robot in maze
Robot Learning Gym (RLG) – Robotics tasks for multiple robots, tasks and RL algorithms to get comparative results with standardized metrics
Robotiq-UR5 – MuJoCo-based simulator of UR5 robotic arm with Robotiq gripper
SdSandbox – Self driving car simulator based on Unity and Keras as well as Nvidia PilotNet;
Donkey Gym (Gym environment for donkeycar) was extracted from sdsandbox
Self Driving Sim Gym – Simple 2D Gym-style environments for RL on autonomous car setting with intersection and traffic environment
Self driving car sim – Unity-based autonomous car simulator by Udacity
S-RL Toolbox [71], docs, video – RL and State Representation Learning (SRL) for robotics with 10 Stable-Baselines algorithms, HPO, also see SRL Zoo, docs

Control and Safety

DeepMind AI Safety Gridworlds [72]
DeepMind Control Suite [73], can convert to Gym environment via dm2gym
Golds-rl-gym – Continuous control multi-agent environments
Gym Vision – Gym-based continuous control vision tasks
Safety Gym [74]

Finance, Medicine and Abstract Domains

AgentSimulator – Predator & prey simulator in Java
Banana Gym – Stochastic Gym environment based on banana selling setting
Btgym – Gym environment for Backtrader trading library
EnergyPy – RL experiments on energy environments based on Tensorflow
EnMAS – Environment for Multi-Agent Simulation, framework for specifying POMDP or POSG problems and agents with clean specification syntax and client-server architecture
GamePad [75] – Python library providing environment for Coq Interactive Theorem Proving (ITP)
Gym-bitflip – Bitflip environment, suited for Hindsight Experience Replay
Gym-BSS – Gym environment for Bike Sharing System
Gym-ERSLE – Gym environment for Emergency Response System (ERS) to solve ambulance allocation problem
GymFC, video [76] – Gym environment for intelligent flight control systems
Gym Memory – Simple 2D Gym environments for memory experiments inspired by rodent experiments
Gym Music – Abstract music Gym environment and rewards based on Magenta’s RL Tuner
Gym RLCrptocurrency – Gym environment for RLCryptocurrency
MiniWoB++ [77] – Extended version of OpenAI’s MiniWoB benchmark that can interact with the web via Selenium
Misc RL – Gym environment for ForEx trading
OpenSim RL, code – Environments with musculoskeletal model, part of NIPS 2018 AI for prothetics, uses OpenSim for biomechanical simulation
Personae – Environment for quantitative trading including stock and future trading
PGPortfolio – DRL framework for financial portfolio management, based on [78]
RL aqs – Queuing simulator for adaptive task assignment problems via RL control (SARSA), can run on clusters via MPI
Stock Market RL – Keras-based Gym environment for stock market trading, supports PG and DQN, not updated for 2 years
TradingGym – Gym-like Trading environment for both RL and rule-based approaches

Language and Communication

Azkaban – MARL environments focusing on communication
Facebook CommAI-env
ParlAI [79] ”There are no RL code examples right now, but ParlAI is set up to use RL. E.g. we included a reward field in the action/observation message.” (source)
TextWorld [80]

D. Simulators

This list was adapted from Mark Hammond, O’Reilly AI San Francisco 2017.

Architecture and Urban Planning

Chemistry

Discrete Events

Game-Based

Mechanical and Electrical Engineering

Ahkab
CAMotics
gEDA
GNU Octave
MATLAB & Simulink – also see their REST API
OpenModelica
Qucs
Scilab Xcos
Sinumerik
Wired Logic
Wolfram SystemModeler

Medicine and Biotech

Military

Cloud and Networking

Robotics

Ardupilot
Gazebo
MuJoCo
NVIDIA Virtual Simulator for Robotics / NVIDIA Isaac Sim
RobotExpert
RobotStudio
RotorS
STDR Simulator
V-REP – robot experimentation platform

Transporation

Vehicle (Air, Land, Sea, Space)

AirSim
Baldr
Bridge Command
CARLA [81]
DeepDrive
FlightGear
General Dyanmics VirtualShip
Kongsberg K-Sim – Maritime Simulation
NAUTIS Maritime Simulator
NVIDIA DRIVE Constellation
OpenBVE Train Simulator
Open Rails
PC Maritime Maritime Simulators, UK-Based
Prepar3D (Lockheed Martin)
Rigs of Rods Softbody Physics Simulator
Sailaway
Speed Dreams
TORCS [82]
Udacity’s Self-Driving Car Simulator
UNIGINE2
Unity: XVEHICLE Driving Simulation
Unreal Engine: Connected Vehicle Research
YachtSim

E. Algorithms

I follow the Algorithm Taxonomy from [83] and additionally follow the categorization from Intel Coach as well as OpenAI’s Key Papers in Deep RL collection.

Model-Free Methods
1 a) Value-Based Optimization

1 b) Policy-Based Optimization and Policy-Gradient

Advantage Actor Critic (A2C), synchronous version of A3C
A3C – Asynchronous Advantage Actor Critic [97]
ACER – Actor-Critic with Experience Replay [98]
ACKTR – Actor Critic using Kronecker-Factored Trust Region [99]
CPPO – Clipped PPO [100]
D4PG – Distributed Distributional Deep Deterministic Policy Gradient [101]
DPG – Deterministic Policy Gradient [102]
DDPG – Deep Deterministic Policy Gradient [103]
GAE – Generalized Advantage Estimation [104]
IPG – Interpolated Policy Gradient, alternative [105]
MADDPG – Multi-Agent Deep Deterministic Policy Gradient [106]
PPO – Proximal Policy Optimization [107] (based on A2C)
SAC – Soft Actor Critic [108]
TD3 – Twin Delayed Deep Deterministic Policy Gradient [109]
TRPO – Trust Region Policy Optimization [110]

Model-Based Methods

AlphaZero [111]
ExIt [112]
I2A [113]
MB-MPO – Model-Based Reinforcement Learning via Meta-Policy Optimization [114]
MBMF [115]
ME-TRPO – Model-Ensemble Trust-Region Policy Optimization [116]
MVE – Model-Based Value Expansion [117]
Recurrent World Models [118]
STEVE – Stochastic Ensemble Value Expansion [119]

Hierarchical RL

FeUdal Networks [120]
HAC – Hierarchical Actor Critic [121], based on DDPG
HIRO [122]
STRAW – Strategic Attentive Writer for Learning Macro-Actions [123]

Imitation Learning and Inverse Reinforcement Learning (IRL)

CIL – Conditional Imitation Learning [124]
DeepMimic, [125]
GAIL – Generative Adversarial Imitation Learning [126]
GCL – Guided Cost Learning [127]
MARWIL – Monotonic Advantage Re-Weighted Imitation Learning [128]
MetaMimic [129]
SIL – Self-Imitation Learning [130]
VDB, VAIL, VAIRL [131]

Transfer and Multitask RL

Meta Reinforcement Learning

Memory

HER – Hindsight Experience Replay [141]
MERLIN [142]
MFEC – Model-Free Episodic Control [143]
Neural Map [144]
PER – Prioritized Experience Replay [145]
RMC – Relational Memory Core [146]

Exploration Techniques

Distributed RL

RL Safety

CPO – Constrained Policy Optimization [164]
DDPG and Safety Layer [165]
HIRL [166]
Leave No Trace [167]
LFP [168]

Evolutionary Strategies

Further RL Algorithms

ARM – Advantage-Based Regret Minimization [171]
ARS – Augmented Random Search [172]
AS – Average Strategy Sampling, a Monte-Carlo Counterfactual Regret Minimization variant [173]
BA3C – Batch Asynchronous Advantage Actor-Critic [174]
BatchPPO [175]
BCO – Behavioral Cloning from Observation [176]
BDQ – Branching Dueling Q-Network [177]
CF-GPS – Counterfactually-Guided Policy Search [178]
COMA – Counterfactual Multi-Agent Policy Gradients [179]
CRAR – Combined Reinforcement via Abstract Representations [180]
DDRQN – Deep Distributed Recurrent Q-Networks [181] – multi-agent communication-based coordination
DFP – Direct Future Prediction [182]
DPP – Dynamic Policy Programming [183], a policy iteration method
DQFD: Deep Q-Learning from Demonstration [184]
DRQN – Deep Recurrent Q-Network [185], replaces first post-convolutional FC layer in DQN with recurrent LSTM, makes agent more robust if observation quality at runtime fluctuates
GreedyGQ [186]
IQL – Independent Q-Learning [187]
IQN – Implicit Quantile Networks [188]
MuZero [189]
OC – Option Critic Architecture [190]
PCL – Path Consistency Learning [191]
PETS – Probabilistic Ensembles with Trajectory Sampling [192]
PGQL – Policy Gradient and Q-Learning [193]
PG / REINFORCE – Policy Gradients [194]
PPO-λ [195]
Q-Prop [196]
QTRAN [197]
Reactor [198]
Stein Control Variates [199]
Trust-PCL [200]
VDN – Value Decomposition Networks [201]
Wolpertinger [202]

F. Companies

Autonomous Vehicle Research (AVR)

Aptiv, acquired NuTonomy
Argo, acquired AID
Aurora, acquired Uber Advanced Technology Group (ATG)
BMW – Autonomous Driving
Bosch Mobility Solutions
Latent Logic – realistic behavior models for humanoid agents for autonomous cars
Mercedes Benz – Autonomous Driving
Nuro
Waymo – Google owned
Wayve – self-driving car software platform out of Cambridge University
Woven Planet, owned by Toyota, acquired Lyft Level 5 Self-Driving Division

General & Research Labs

Borealis AI – AI research institute out of Royal Bank of Canada
DeepMind – Famous for beating top human Go players with Alpha Go, success in playing Atari games and developing Neural Turing Machine, acquired by Alphabet
Imandra Provides reasoning as a service including symbolic reasoning and formal verification, but also incorporates DRL with applications in finance, robotics and autonomous systems
nnaisense – Jürgen Schmidhuber’s research-focused company to apply AI to wide range of applications including autonomous systems and finance
OpenAI – non-profit institute for researching safe artificial general intelligence with for-profit arm OpenAI LP
Rasa – RL-driven NLU for dialog systems and virtual assistants

Process Optimization, Finance, Security and Pharma

AI Capital Management – DRL for Trading
Bonsai – DRL solutions for industrial applications including energy systems, HVAC, manufacturing and process automation, based in Berkeley, acquired by Microsoft
Cogitai – Continua platform for decision support and business process optimization
DataOne – platform for intelligent business decisions, uses RL for model retraining
Desire2Learn – learning management system with learning material recommendation
HiHedge – DRL for Trading
InstaDeep – process optimization for energy sector, manufacturing, mobility and logistics
MediaGamma DRL-based decision support, e.g. for advertising and customer acquisition
OPTIMAL – autonomous indoor farming solutions
PerfectPattern – Industrial Process Control and Management
PerimeterX – Bot Defender product provides predictive security against botnet attacks via RL
Phenomic – Drug discovery and antibody development with RL suggesting promising syntheses
ProteinQure – Drug discovery
Prowler – Provides VUKU platform for decision making in diverse domains like autonomous systems, finance and logistics
Qstream – Microlearning

Robotics

AI4Things – Industrial and agricultural robotics as well as personal delivery and pest control
Boston Dynamics
Covariant (was Embodied Intelligence) – robot teaching
DoraBot – Robotics for logistics
micropsi industries – MIRAI robotics platform
Osaro – Warehouse automation and robotics, e.g. for distribution centers and manufacturing
SoarTech – Autonomous systems and decision support for military

References

[1] K. Shibata, “Functions that emerge through end-to-end reinforcement learning,” CoRR, vol. abs/1703.02239, 2017. [Online]. Available: http://arxiv.org/abs/1703.02239

[2] P. S. Castro, S. Moitra, C. Gelada, S. Kumar, and M. G. Bellemare, “Dopamine: A Research Framework for Deep Reinforcement Learning,” 2018. [Online]. Available: http://arxiv.org/abs/1812.06110

[3] D. Salvadori, “huskarl,” https://github.com/danaugrs/huskarl, 2019.

[4] M. Plappert, “keras-rl,” https://github.com/keras-rl/keras-rl, 2016.

[5] Sergio Guadarrama, Anoop Korattikara, “Tf-agents: A library for reinforcement learning in tensorflow,” https://github.com/tensorflow/agents, 2018, [Online; accessed 25-June-2019]. [Online]. Available: https://github.com/tensorflow/agents

[6] D. Abel, “simple_rl: Reproducible reinforcement learning in python,” 2019.

[7] M. Schaarschmidt, A. Kuhnle, and K. Fricke, “Tensorforce: A tensorflow library for applied reinforcement learning,” Web page, 2017. [Online]. Available: https://github.com/reinforceio/tensorforce

[8] M. Schaarschmidt, A. Kuhnle, B. Ellis, K. Fricke, F. Gessert, and E. Yoneki, “LIFT: reinforcement learning in computer systems by learning from demonstrations,” CoRR, vol. abs/1808.07903, 2018. [Online]. Available: http://arxiv.org/abs/1808.07903

[9] S. Kolesnikov and O. Hrinchuk, “Catalyst.rl: A distributed framework for reproducible RL research,” CoRR, vol. abs/1903.00027, 2019. [Online]. Available: http://arxiv.org/abs/1903.00027

[10] X. Zuo, “lagom: A pytorch infrastructure for rapid prototyping of reinforcement learning algorithms,” https://github.com/zuoxingdong/lagom, 2018.

[11] A. Stooke and P. Abbeel, “rlpyt: A research code base for deep reinforcement learning in pytorch,” 2019.

[12] L. Fan, Y. Zhu, J. Zhu, Z. Liu, O. Zeng, A. Gupta, J. Creus-Costa, S. Savarese, and L. Fei-Fei, “Surreal: Open-source reinforcement learning framework and robot manipulation benchmark,” in Conference on Robot Learning, 2018.

[13] J. Gauci, E. Conti, Y. Liang, K. Virochsiri, Y. He, Z. Kaden, V. Narayanan, and X. Ye, “Horizon: Facebook’s open source applied reinforcement learning platform,” CoRR, vol. abs/1811.00260, 2018. [Online]. Available: http://arxiv.org/abs/1811.00260

[14] Y. Fujita, T. Kataoka, P. Nagarajan, and T. Ishikawa, “Chainerrl: A deep reinforcement learning library,” 2019.

[15] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” CoRR, vol. abs/1604.06778, 2016. [Online]. Available: http://arxiv.org/abs/1604.06778

[16] C. D’Eramo, D. Tateo, A. Bonarini, M. Restelli, and J. Peters, “Mushroomrl: Simplifying reinforcement learning research,” 2020.

[17] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, W. Paul, M. I. Jordan, and I. Stoica, “Ray: A distributed framework for emerging AI applications,” CoRR, vol. abs/1712.05889, 2017. [Online]. Available: http://arxiv.org/abs/1712.05889

[18] E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, J. Gonzalez, K. Goldberg, and I. Stoica, “Ray rllib: A composable and scalable reinforcement learning library,” CoRR, vol. abs/1712.09381, 2017. [Online]. Available: http://arxiv.org/abs/1712.09381

[19] I. Caspi, G. Leibovich, G. Novik, and S. Endrawis, “Reinforcement learning coach,” Dec. 2017. [Online]. Available: https://doi.org/10.5281/zenodo.1134899

[20] M. Schaarschmidt, S. Mika, K. Fricke, and E. Yoneki, “RLgraph: Flexible Computation Graphs for Deep Reinforcement Learning,” ArXiv e-prints, Oct. 2018.

[21] L. Zheng, J. Yang, H. Cai, W. Zhang, J. Wang, and Y. Yu, “Magent: A many-agent reinforcement learning platform for artificial collective intelligence,” 2017.

[22] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.

[23] X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang, “Deep learning for real-time atari game play using offline monte-carlo tree search planning,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 3338–3346. [Online]. Available: http://papers.nips.cc/paper/5421-deep-learning-for-real-time-atari-game-play-using-offline-monte-carlo-tree-search-planning.pdf

[24] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015. [Online]. Available: http://dx.doi.org/10.1038/nature14236

[25] T. Zahavy, N. Ben-Zrihem, and S. Mannor, “Graying the black box: Understanding dqns,” CoRR, vol. abs/1602.02658, 2016. [Online]. Available: http://arxiv.org/abs/1602.02658

[26] D. Cortes, “Adapting multi-armed bandits policies to contextual bandits scenarios,” arXiv preprint arXiv:1811.04383, 2018.

[27] Z. Shangtong, “Modularized implementation of deep rl algorithms in pytorch,” https://github.com/ShangtongZhang/DeepRL, 2018.

[28] P. Andersen, M. Goodwin, and O. Granmo, “Flashrl: A reinforcement learning platform for flash games,” CoRR, vol. abs/1801.08841, 2018. [Online]. Available: http://arxiv.org/abs/1801.08841

[29] I. Caspi, G. Leibovich, G. Novik, and S. Endrawis, “Reinforcement learning coach,” Dec. 2017. [Online]. Available: https://doi.org/10.5281/zenodo.1134899

[30] T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, T. Rückstieß, and J. Schmidhuber, “Pybrain,” J. Mach. Learn. Res., vol. 11, pp. 743–746, Mar. 2010. [Online]. Available: http://dl.acm.org/citation.cfm?id=1756006.1756030

[31] B. Tanner and A. White, “Rl-glue: Language-independent software for reinforcement-learning experiments,” J. Mach. Learn. Res., vol. 10, pp. 2133–2136, Dec. 2009. [Online]. Available: http://dl.acm.org/citation.cfm?id=1577069.1755857

[32] A. Geramifard, C. Dann, R. H. Klein, W. Dabney, and J. P. How, “Rlpy: A value-function-based reinforcement learning framework for education and research,” Journal of Machine Learning Research, vol. 16, no. 46, pp. 1573–1578, 2015. [Online]. Available: http://jmlr.org/papers/v16/geramifard15a.html

[33] A. Mandlekar, Y. Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, S. Savarese, and L. Fei-Fei, “Roboturk: A crowdsourcing platform for robotic skill learning through imitation,” in Conference on Robot Learning, 2018.

[34] A. Kirsch, “MDP environments for the openai gym,” CoRR, vol. abs/1709.09069, 2017. [Online]. Available: http://arxiv.org/abs/1709.09069

[35] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,” CoRR, vol. abs/1207.4708, 2012. [Online]. Available: http://arxiv.org/abs/1207.4708

[36] Y. Tian, Q. Gong, W. Shang, Y. Wu, and L. Zitnick, “ELF: an extensive, lightweight and flexible research platform for real-time strategy games,” CoRR, vol. abs/1707.01067, 2017. [Online]. Available: http://arxiv.org/abs/1707.01067

[37] Y. Li, H. Chang, Y. Lin, P. Wu, and Y. F. Wang, “Deep reinforcement learning for playing 2.5d fighting games,” CoRR, vol. abs/1805.02070, 2018. [Online]. Available: http://arxiv.org/abs/1805.02070

[38] J. K. Gupta, M. Egorov, and M. Kochenderfer, “Cooperative multi-agent control using deep reinforcement learning,” in International Conference on Autonomous Agents and Multiagent Systems. Springer, 2017, pp. 66–83.

[39] S. Jiang, “Multi agent reinforcement learning environments compilation,” 2019.

[40] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” CoRR, vol. abs/1706.02275, 2017. [Online]. Available: http://arxiv.org/abs/1706.02275

[41] I. Mordatch and P. Abbeel, “Emergence of grounded compositional language in multi-agent populations,” CoRR, vol. abs/1703.04908, 2017. [Online]. Available: http://arxiv.org/abs/1703.04908

[42] A. Nichol, V. Pfau, C. Hesse, O. Klimov, and J. Schulman, “Gotta learn fast: A new benchmark for generalization in rl,” arXiv preprint arXiv:1804.03720, 2018.

[43] N. Bhonker, S. Rozenberg, and I. Hubara, “Playing snes in the retro learning environment,” arXiv preprint arXiv:1611.02205, 2016.

[44] M. Chevalier-Boisvert, D. Bahdanau, S. Lahlou, L. Willems, C. Saharia, T. H. Nguyen, and Y. Bengio, “Babyai: First steps towards grounded language learning with a human in the loop,” CoRR, vol. abs/1810.08272, 2018. [Online]. Available: http://arxiv.org/abs/1810.08272

[45] S. Sukhbaatar, A. Szlam, G. Synnaeve, S. Chintala, and R. Fergus, “Mazebase: A sandbox for learning from games,” CoRR, vol. abs/1511.07401, 2015. [Online]. Available: http://arxiv.org/abs/1511.07401

[46] X. Zuo, “mazelab: A customizable framework to create maze and gridworld environments.” https://github.com/zuoxingdong/mazelab, 2018.

[47] E. Kolve, R. Mottaghi, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi, “AI2-THOR: an interactive 3d environment for visual AI,” CoRR, vol. abs/1712.05474, 2017. [Online]. Available: http://arxiv.org/abs/1712.05474

[48] B. Beyret, J. Hernández-Orallo, L. Cheke, M. Halina, M. Shanahan, and M. Crosby, “The animal-ai environment: Training and testing animal-like artificial cognition,” 2019.

[49] C. Yan, D. K. Misra, A. Bennett, A. Walsman, Y. Bisk, and Y. Artzi, “CHALET: cornell house agent learning environment,” CoRR, vol. abs/1801.07357, 2018. [Online]. Available: http://arxiv.org/abs/1801.07357

[50] D. K. Misra, A. Bennett, V. Blukis, E. Niklasson, M. Shatkhin, and Y. Artzi, “Mapping instructions to actions in 3d environments with visual goal prediction,” CoRR, vol. abs/1809.00786, 2018. [Online]. Available: http://arxiv.org/abs/1809.00786

[51] C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V. Valdés, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis, S. Legg, and S. Petersen, “Deepmind lab,” CoRR, vol. abs/1612.03801, 2016. [Online]. Available: http://arxiv.org/abs/1612.03801

[52] J. Z. Leibo, C. de Masson d’Autume, D. Zoran, D. Amos, C. Beattie, K. Anderson, A. G. Castañeda, M. Sanchez, S. Green, A. Gruslys, S. Legg, D. Hassabis, and M. Botvinick, “Psychlab: A psychology laboratory for deep reinforcement learning agents,” CoRR, vol. abs/1801.08116, 2018. [Online]. Available: http://arxiv.org/abs/1801.08116

[53] Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A Platform for Embodied AI Research,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.

[54] Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian, “Building generalizable agents with a realistic and rich 3d environment,” CoRR, vol. abs/1801.02209, 2018. [Online]. Available: http://arxiv.org/abs/1801.02209

[55] J. Greaves, M. Robinson, N. Walton, M. Mortensen, R. Pottorff, C. Christopherson, D. Hancock, and D. Wingate, “Holodeck: A high fidelity simulator,” 2018.

[56] S. Brodeur, E. Perez, A. Anand, F. Golemo, L. Celotti, F. Strub, J. Rouat, H. Larochelle, and A. C. Courville, “Home: a household multimodal environment,” CoRR, vol. abs/1711.11017, 2017. [Online]. Available: http://arxiv.org/abs/1711.11017

[57] M. Johnson, K. Hofmann, T. Hutton, D. Bignell, and K. Hofmann, “The malmo platform for artificial intelligence experimentation.” AAAI – Association for the Advancement of Artificial Intelligence, July 2016. [Online]. Available: https://www.microsoft.com/en-us/research/publication/malmo-platform-artificial-intelligence-experimentation/

[58] P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. D. Reid, S. Gould, and A. van den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” CoRR, vol. abs/1711.07280, 2017. [Online]. Available: http://arxiv.org/abs/1711.07280

[59] A. X. Chang, A. Dai, T. A. Funkhouser, M. Halber, M. N. ner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” 2017 International Conference on 3D Vision (3DV), pp. 667–676, 2017.

[60] T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch, “Emergent complexity via multi-agent competition,” CoRR, vol. abs/1710.03748, 2017. [Online]. Available: http://arxiv.org/abs/1710.03748

[61] M. Al-Shedivat, T. Bansal, Y. Burda, I. Sutskever, I. Mordatch, and P. Abbeel, “Continuous adaptation via meta-learning in nonstationary and competitive environments,” CoRR, vol. abs/1710.03641, 2017. [Online]. Available: http://arxiv.org/abs/1710.03641

[62] A. Kumar, N. Paul, and S. N. Omkar, “Bipedal walking robot using deep deterministic policy gradient,” CoRR, vol. abs/1807.05924, 2018. [Online]. Available: http://arxiv.org/abs/1807.05924

[63] A. Kanervisto and V. Hautamäki, “Torille: Learning environment for hand-to-hand combat,” CoRR, vol. abs/1807.10110, 2018. [Online]. Available: http://arxiv.org/abs/1807.10110

[64] A. Juliani, V. Berges, E. Vckay, Y. Gao, H. Henry, M. Mattar, and D. Lange, “Unity: A general platform for intelligent agents,” CoRR, vol. abs/1809.02627, 2018. [Online]. Available: http://arxiv.org/abs/1809.02627

[65] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaskowski, “Vizdoom: A doom-based AI research platform for visual reinforcement learning,” CoRR, vol. abs/1605.02097, 2016. [Online]. Available: http://arxiv.org/abs/1605.02097

[66] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics, 2017. [Online]. Available: https://arxiv.org/abs/1705.05065

[67] F. Xia, A. R. Zamir, Z.-Y. He, A. Sax, J. Malik, and S. Savarese, “Gibson env: real-world perception for embodied agents,” in Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE, 2018.

[68] I. Zamora, N. G. Lopez, V. M. Vilches, and A. H. Cordero, “Extending the openai gym for robotics: a toolkit for reinforcement learning using ROS and gazebo,” CoRR, vol. abs/1608.05742, 2016. [Online]. Available: http://arxiv.org/abs/1608.05742

[69] M. Savva, A. X. Chang, A. Dosovitskiy, T. A. Funkhouser, and V. Koltun, “MINOS: multimodal indoor simulator for navigation in complex environments,” CoRR, vol. abs/1712.03931, 2017. [Online]. Available: http://arxiv.org/abs/1712.03931

[70] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033, 2012.

[71] A. Raffin, A. Hill, R. Traoré, T. Lesort, N. Díaz-Rodríguez, and D. Filliat, “S-rl toolbox: Environments, datasets and evaluation metrics for state representation learning,” arXiv preprint arXiv:1809.09369, 2018.

[72] J. Leike, M. Martic, V. Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq, L. Orseau, and S. Legg, “AI safety gridworlds,” CoRR, vol. abs/1711.09883, 2017. [Online]. Available: http://arxiv.org/abs/1711.09883

[73] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. P. Lillicrap, and M. A. Riedmiller, “Deepmind control suite,” CoRR, vol. abs/1801.00690, 2018. [Online]. Available: http://arxiv.org/abs/1801.00690

[74] A. Ray, J. Achiam, and D. Amodei, “Benchmarking Safe Exploration in Deep Reinforcement Learning,” 2019.

[75] D. Huang, P. Dhariwal, D. Song, and I. Sutskever, “Gamepad: A learning environment for theorem proving,” CoRR, vol. abs/1806.00608, 2018. [Online]. Available: http://arxiv.org/abs/1806.00608

[76] W. Koch, R. Mancuso, R. West, and A. Bestavros, “Reinforcement learning for uav attitude control,” ACM Transactions on Cyber-Physical Systems, vol. 3, no. 2, p. 22, 2019.

[77] E. Z. Liu, K. Guu, P. Pasupat, T. Shi, and P. Liang, “Reinforcement learning on web interfaces using workflow-guided exploration,” CoRR, vol. abs/1802.08802, 2018. [Online]. Available: http://arxiv.org/abs/1802.08802

[78] Z. Jiang, D. Xu, and J. Liang, “A deep reinforcement learning framework for the financial portfolio management problem,” 06 2017.

[79] A. H. Miller, W. Feng, A. Fisch, J. Lu, D. Batra, A. Bordes, D. Parikh, and J. Weston, “Parlai: A dialog research software platform,” CoRR, vol. abs/1705.06476, 2017. [Online]. Available: http://arxiv.org/abs/1705.06476

[80] M.-A. Côté, A. Kádár, X. Yuan, B. Kybartas, T. Barnes, E. Fine, J. Moore, M. Hausknecht, L. E. Asri, M. Adada, W. Tay, and A. Trischler, “Textworld: A learning environment for text-based games,” CoRR, vol. abs/1806.11532, 2018.

[81] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in Proceedings of the 1st Annual Conference on Robot Learning, 2017, pp. 1–16.

[82] B. Wymann, C. Dimitrakakis, A. D. Sumner, E. Espié, and C. Guionneau, “Torcs : The open racing car simulator,” 2015.

[83] L. Graesser and W. Keng, Foundations of Deep Reinforcement Learning: Theory and Practice in Python, ser. Addison-Wesley Data & Analytics Series. Pearson Education, 2019. [Online]. Available: https://books.google.com/books?id=0HW7DwAAQBAJ

[84] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” CoRR, vol. abs/1707.06887, 2017. [Online]. Available: http://arxiv.org/abs/1707.06887

[85] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” CoRR, vol. abs/1509.06461, 2015. [Online]. Available: http://arxiv.org/abs/1509.06461

[86] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller, “Playing atari with deep reinforcement learning,” CoRR, vol. abs/1312.5602, 2013. [Online]. Available: http://arxiv.org/abs/1312.5602

[87] Z. Wang, N. de Freitas, and M. Lanctot, “Dueling network architectures for deep reinforcement learning,” CoRR, vol. abs/1511.06581, 2015. [Online]. Available: http://arxiv.org/abs/1511.06581

[88] G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos, “Count-based exploration with neural density models,” CoRR, vol. abs/1703.01310, 2017. [Online]. Available: http://arxiv.org/abs/1703.01310

[89] S. Gu, T. P. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep q-learning with model-based acceleration,” CoRR, vol. abs/1603.00748, 2016. [Online]. Available: http://arxiv.org/abs/1603.00748

[90] A. Pritzel, B. Uria, S. Srinivasan, A. P. Badia, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell, “Neural episodic control,” CoRR, vol. abs/1703.01988, 2017. [Online]. Available: http://arxiv.org/abs/1703.01988

[91] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” CoRR, vol. abs/1602.01783, 2016. [Online]. Available: http://arxiv.org/abs/1602.01783

[92] M. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, and R. Munos, “Increasing the action gap: New operators for reinforcement learning,” CoRR, vol. abs/1512.04860, 2015. [Online]. Available: http://arxiv.org/abs/1512.04860

[93] T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. N. Foerster, and S. Whiteson, “QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning,” CoRR, vol. abs/1803.11485, 2018. [Online]. Available: http://arxiv.org/abs/1803.11485

[94] W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos, “Distributional reinforcement learning with quantile regression,” CoRR, vol. abs/1710.10044, 2017. [Online]. Available: http://arxiv.org/abs/1710.10044

[95] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine, “Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation,” CoRR, vol. abs/1806.10293, 2018. [Online]. Available: http://arxiv.org/abs/1806.10293

[96] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” 2017.

[97] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” CoRR, vol. abs/1602.01783, 2016. [Online]. Available: http://arxiv.org/abs/1602.01783

[98] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas, “Sample efficient actor-critic with experience replay,” CoRR, vol. abs/1611.01224, 2016. [Online]. Available: http://arxiv.org/abs/1611.01224

[99] Y. Wu, E. Mansimov, S. Liao, R. B. Grosse, and J. Ba, “Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation,” CoRR, vol. abs/1708.05144, 2017. [Online]. Available: http://arxiv.org/abs/1708.05144

[100] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017. [Online]. Available: http://arxiv.org/abs/1707.06347

[101] G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. TB, A. Muldal, N. Heess, and T. P. Lillicrap, “Distributed distributional deterministic policy gradients,” CoRR, vol. abs/1804.08617, 2018. [Online]. Available: http://arxiv.org/abs/1804.08617

[102] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. A. Riedmiller, “Deterministic policy gradient algorithms,” in ICML, 2014.

[103] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” 2015.

[104] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” CoRR, vol. abs/1506.02438, 2015. [Online]. Available: http://arxiv.org/abs/1506.02438

[105] S. S. Gu, T. Lillicrap, R. E. Turner, Z. Ghahramani, B. Schölkopf, and S. Levine, “Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 3846–3855. [Online]. Available: http://papers.nips.cc/paper/6974-interpolated-policy-gradient-merging-on-policy-and-off-policy-gradient-estimation-for-deep-reinforcement-learning.pdf

[106] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” CoRR, vol. abs/1706.02275, 2017. [Online]. Available: http://arxiv.org/abs/1706.02275

[107] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017. [Online]. Available: http://arxiv.org/abs/1707.06347

[108] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” CoRR, vol. abs/1801.01290, 2018. [Online]. Available: http://arxiv.org/abs/1801.01290

[109] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” CoRR, vol. abs/1802.09477, 2018. [Online]. Available: http://arxiv.org/abs/1802.09477

[110] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,” CoRR, vol. abs/1502.05477, 2015. [Online]. Available: http://arxiv.org/abs/1502.05477

[111] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. P. Lillicrap, K. Simonyan, and D. Hassabis, “Mastering chess and shogi by self-play with a general reinforcement learning algorithm,” CoRR, vol. abs/1712.01815, 2017. [Online]. Available: http://arxiv.org/abs/1712.01815

[112] T. Anthony, Z. Tian, and D. Barber, “Thinking fast and slow with deep learning and tree search,” CoRR, vol. abs/1705.08439, 2017. [Online]. Available: http://arxiv.org/abs/1705.08439

[113] T. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, R. Pascanu, P. Battaglia, D. Silver, and D. Wierstra, “Imagination-augmented agents for deep reinforcement learning,” CoRR, vol. abs/1707.06203, 2017. [Online]. Available: http://arxiv.org/abs/1707.06203

[114] I. Clavera, J. Rothfuss, J. Schulman, Y. Fujita, T. Asfour, and P. Abbeel, “Model-based reinforcement learning via meta-policy optimization,” CoRR, vol. abs/1809.05214, 2018. [Online]. Available: http://arxiv.org/abs/1809.05214

[115] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning,” CoRR, vol. abs/1708.02596, 2017. [Online]. Available: http://arxiv.org/abs/1708.02596

[116] T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel, “Model-ensemble trust-region policy optimization,” CoRR, vol. abs/1802.10592, 2018. [Online]. Available: http://arxiv.org/abs/1802.10592

[117] V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine, “Model-based value estimation for efficient model-free reinforcement learning,” CoRR, vol. abs/1803.00101, 2018. [Online]. Available: http://arxiv.org/abs/1803.00101

[118] D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution,” CoRR, vol. abs/1809.01999, 2018. [Online]. Available: http://arxiv.org/abs/1809.01999

[119] J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee, “Sample-efficient reinforcement learning with stochastic ensemble value expansion,” CoRR, vol. abs/1807.01675, 2018. [Online]. Available: http://arxiv.org/abs/1807.01675

[120] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu, “Feudal networks for hierarchical reinforcement learning,” CoRR, vol. abs/1703.01161, 2017. [Online]. Available: http://arxiv.org/abs/1703.01161

[121] A. Levy, R. P. Jr., and K. Saenko, “Hierarchical actor-critic,” CoRR, vol. abs/1712.00948, 2017. [Online]. Available: http://arxiv.org/abs/1712.00948

[122] O. Nachum, S. Gu, H. Lee, and S. Levine, “Data-efficient hierarchical reinforcement learning,” CoRR, vol. abs/1805.08296, 2018. [Online]. Available: http://arxiv.org/abs/1805.08296

[123] A. Vezhnevets, V. Mnih, J. Agapiou, S. Osindero, A. Graves, O. Vinyals, and K. Kavukcuoglu, “Strategic attentive writer for learning macro-actions,” CoRR, vol. abs/1606.04695, 2016. [Online]. Available: http://arxiv.org/abs/1606.04695

[124] F. Codevilla, M. Müller, A. Dosovitskiy, A. López, and V. Koltun, “End-to-end driving via conditional imitation learning,” CoRR, vol. abs/1710.02410, 2017. [Online]. Available: http://arxiv.org/abs/1710.02410

[125] X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based character skills,” CoRR, vol. abs/1804.02717, 2018. [Online]. Available: http://arxiv.org/abs/1804.02717

[126] J. Ho and S. Ermon, “Generative adversarial imitation learning,” CoRR, vol. abs/1606.03476, 2016. [Online]. Available: http://arxiv.org/abs/1606.03476

[127] C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverse optimal control via policy optimization,” CoRR, vol. abs/1603.00448, 2016. [Online]. Available: http://arxiv.org/abs/1603.00448

[128] Q. Wang, J. Xiong, L. Han, p. sun, H. Liu, and T. Zhang, “Exponentially weighted imitation learning for batched historical data,” in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds. Curran Associates, Inc., 2018, pp. 6288–6297. [Online]. Available: http://papers.nips.cc/paper/7866-exponentially-weighted-imitation-learning-for-batched-historical-data.pdf

[129] T. L. Paine, S. G. Colmenarejo, Z. Wang, S. E. Reed, Y. Aytar, T. Pfaff, M. W. Hoffman, G. Barth-Maron, S. Cabi, D. Budden, and N. de Freitas, “One-shot high-fidelity imitation: Training large-scale deep nets with RL,” CoRR, vol. abs/1810.05017, 2018. [Online]. Available: http://arxiv.org/abs/1810.05017

[130] J. Oh, Y. Guo, S. Singh, and H. Lee, “Self-imitation learning,” CoRR, vol. abs/1806.05635, 2018. [Online]. Available: http://arxiv.org/abs/1806.05635

[131] X. B. Peng, A. Kanazawa, S. Toyer, P. Abbeel, and S. Levine, “Variational discriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining information flow,” CoRR, vol. abs/1810.00821, 2018. [Online]. Available: http://arxiv.org/abs/1810.00821

[132] S. Cabi, S. G. Colmenarejo, M. W. Hoffman, M. Denil, Z. Wang, and N. de Freitas, “The intentional unintentional agent: Learning to solve many continuous control tasks simultaneously,” CoRR, vol. abs/1707.03300, 2017. [Online]. Available: http://arxiv.org/abs/1707.03300

[133] M. Wulfmeier, I. Posner, and P. Abbeel, “Mutual alignment transfer learning,” CoRR, vol. abs/1707.07907, 2017. [Online]. Available: http://arxiv.org/abs/1707.07907

[134] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra, “Pathnet: Evolution channels gradient descent in super neural networks,” CoRR, vol. abs/1701.08734, 2017. [Online]. Available: http://arxiv.org/abs/1701.08734

[135] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” CoRR, vol. abs/1606.04671, 2016. [Online]. Available: http://arxiv.org/abs/1606.04671

[136] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised auxiliary tasks,” CoRR, vol. abs/1611.05397, 2016. [Online]. Available: http://arxiv.org/abs/1611.05397

[137] T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Universal value function approximators,” in Proceedings of the 32Nd International Conference on International Conference on Machine Learning – Volume 37, ser. ICML’15. JMLR.org, 2015, pp. 1312–1320. [Online]. Available: http://dl.acm.org/citation.cfm?id=3045118.3045258

[138] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” CoRR, vol. abs/1703.03400, 2017. [Online]. Available: http://arxiv.org/abs/1703.03400

[139] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel, “Rl$^2$: Fast reinforcement learning via slow reinforcement learning,” CoRR, vol. abs/1611.02779, 2016. [Online]. Available: http://arxiv.org/abs/1611.02779

[140] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, “A simple neural attentive meta-learner,” CoRR, vol. abs/1707.03141, 2017. [Online]. Available: http://arxiv.org/abs/1707.03141

[141] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” CoRR, vol. abs/1707.01495, 2017. [Online]. Available: http://arxiv.org/abs/1707.01495

[142] G. Wayne, C. Hung, D. Amos, M. Mirza, A. Ahuja, A. Grabska-Barwinska, J. W. Rae, P. Mirowski, J. Z. Leibo, A. Santoro, M. Gemici, M. Reynolds, T. Harley, J. Abramson, S. Mohamed, D. J. Rezende, D. Saxton, A. Cain, C. Hillier, D. Silver, K. Kavukcuoglu, M. Botvinick, D. Hassabis, and T. P. Lillicrap, “Unsupervised predictive memory in a goal-directed agent,” CoRR, vol. abs/1803.10760, 2018. [Online]. Available: http://arxiv.org/abs/1803.10760

[143] C. Blundell, B. Uria, A. Pritzel, Y. Li, A. Ruderman, J. Z. Leibo, J. W. Rae, D. Wierstra, and D. Hassabis, “Model-free episodic control,” CoRR, vol. abs/1606.04460, 2016. [Online]. Available: http://arxiv.org/abs/1606.04460

[144] E. Parisotto and R. Salakhutdinov, “Neural map: Structured memory for deep reinforcement learning,” CoRR, vol. abs/1702.08360, 2017. [Online]. Available: http://arxiv.org/abs/1702.08360

[145] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” CoRR, vol. abs/1511.05952, 2015. [Online]. Available: http://arxiv.org/abs/1511.05952

[146] A. Santoro, R. Faulkner, D. Raposo, J. W. Rae, M. Chrzanowski, T. Weber, D. Wierstra, O. Vinyals, R. Pascanu, and T. P. Lillicrap, “Relational recurrent neural networks,” CoRR, vol. abs/1806.01822, 2018. [Online]. Available: http://arxiv.org/abs/1806.01822

[147] I. Osband, C. Blundell, A. Pritzel, and B. V. Roy, “Deep exploration via bootstrapped DQN,” CoRR, vol. abs/1602.04621, 2016. [Online]. Available: http://arxiv.org/abs/1602.04621

[148] M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos, “Unifying count-based exploration and intrinsic motivation,” CoRR, vol. abs/1606.01868, 2016. [Online]. Available: http://arxiv.org/abs/1606.01868

[149] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, “Diversity is all you need: Learning skills without a reward function,” CoRR, vol. abs/1802.06070, 2018. [Online]. Available: http://arxiv.org/abs/1802.06070

[150] J. Fu, J. D. Co-Reyes, and S. Levine, “EX2: exploration with exemplar models for deep reinforcement learning,” CoRR, vol. abs/1703.01260, 2017. [Online]. Available: http://arxiv.org/abs/1703.01260

[151] H. Tang, R. Houthooft, D. Foote, A. Stooke, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel, “#exploration: A study of count-based exploration for deep reinforcement learning,” CoRR, vol. abs/1611.04717, 2016. [Online]. Available: http://arxiv.org/abs/1611.04717

[152] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jul 2017. [Online]. Available: http://dx.doi.org/10.1109/CVPRW.2017.70

[153] M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg, “Noisy networks for exploration,” CoRR, vol. abs/1706.10295, 2017. [Online]. Available: http://arxiv.org/abs/1706.10295

[154] G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos, “Count-based exploration with neural density models,” CoRR, vol. abs/1703.01310, 2017. [Online]. Available: http://arxiv.org/abs/1703.01310

[155] Y. Burda, H. Edwards, A. J. Storkey, and O. Klimov, “Exploration by random network distillation,” CoRR, vol. abs/1810.12894, 2018. [Online]. Available: http://arxiv.org/abs/1810.12894

[156] R. Y. Chen, S. Sidor, P. Abbeel, and J. Schulman, “UCB and infogain exploration via $\boldsymbol{Q}$-ensembles,” CoRR, vol. abs/1706.01502, 2017. [Online]. Available: http://arxiv.org/abs/1706.01502

[157] J. Achiam, H. Edwards, D. Amodei, and P. Abbeel, “Variational option discovery algorithms,” CoRR, vol. abs/1807.10299, 2018. [Online]. Available: http://arxiv.org/abs/1807.10299

[158] K. Gregor, D. J. Rezende, and D. Wierstra, “Variational intrinsic control,” CoRR, vol. abs/1611.07507, 2016. [Online]. Available: http://arxiv.org/abs/1611.07507

[159] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel, “Curiosity-driven exploration in deep reinforcement learning via bayesian neural networks,” CoRR, vol. abs/1605.09674, 2016. [Online]. Available: http://arxiv.org/abs/1605.09674

[160] D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. van Hasselt, and D. Silver, “Distributed prioritized experience replay,” CoRR, vol. abs/1803.00933, 2018. [Online]. Available: http://arxiv.org/abs/1803.00933

[161] A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. D. Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, S. Legg, V. Mnih, K. Kavukcuoglu, and D. Silver, “Massively parallel methods for deep reinforcement learning,” CoRR, vol. abs/1507.04296, 2015. [Online]. Available: http://arxiv.org/abs/1507.04296

[162] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu, “IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures,” CoRR, vol. abs/1802.01561, 2018. [Online]. Available: http://arxiv.org/abs/1802.01561

[163] S. Kapturowski, G. Ostrovski, W. Dabney, J. Quan, and R. Munos, “Recurrent experience replay in distributed reinforcement learning,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=r1lyTjAqYX

[164] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” CoRR, vol. abs/1705.10528, 2017. [Online]. Available: http://arxiv.org/abs/1705.10528

[165] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa, “Safe exploration in continuous action spaces,” CoRR, vol. abs/1801.08757, 2018. [Online]. Available: http://arxiv.org/abs/1801.08757

[166] W. Saunders, G. Sastry, A. Stuhlmüller, and O. Evans, “Trial without error: Towards safe reinforcement learning via human intervention,” CoRR, vol. abs/1707.05173, 2017. [Online]. Available: http://arxiv.org/abs/1707.05173

[167] B. Eysenbach, S. Gu, J. Ibarz, and S. Levine, “Leave no trace: Learning to reset for safe and autonomous reinforcement learning,” CoRR, vol. abs/1711.06782, 2017. [Online]. Available: http://arxiv.org/abs/1711.06782

[168] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 4299–4307. [Online]. Available: http://papers.nips.cc/paper/7017-deep-reinforcement-learning-from-human-preferences.pdf

[169] N. Hansen and A. Ostermeier, “Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation,” 06 1996, pp. 312 – 317.

[170] T. Salimans, J. Ho, X. Chen, and I. Sutskever, “Evolution strategies as a scalable alternative to reinforcement learning,” CoRR, vol. abs/1703.03864, 2017. [Online]. Available: http://arxiv.org/abs/1703.03864

[171] P. H. Jin, S. Levine, and K. Keutzer, “Regret minimization for partially observable deep reinforcement learning,” CoRR, vol. abs/1710.11424, 2017. [Online]. Available: http://arxiv.org/abs/1710.11424

[172] H. Mania, A. Guy, and B. Recht, “Simple random search provides a competitive approach to reinforcement learning,” CoRR, vol. abs/1803.07055, 2018. [Online]. Available: http://arxiv.org/abs/1803.07055

[173] R. Gibson, N. Burch, M. Lanctot, and D. Szafron, “Efficient monte carlo counterfactual regret minimization in games with many player actions,” vol. 3, 12 2012.

[174] I. Adamski, R. Adamski, T. Grel, A. Jedrych, K. Kaczmarek, and H. Michalewski, “Distributed deep reinforcement learning: Learn how to play atari games in 21 minutes,” CoRR, vol. abs/1801.02852, 2018. [Online]. Available: http://arxiv.org/abs/1801.02852

[175] D. Hafner, J. Davidson, and V. Vanhoucke, “Tensorflow agents: Efficient batched reinforcement learning in tensorflow,” CoRR, vol. abs/1709.02878, 2017. [Online]. Available: http://arxiv.org/abs/1709.02878

[176] F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from observation,” CoRR, vol. abs/1805.01954, 2018. [Online]. Available: http://arxiv.org/abs/1805.01954

[177] A. Tavakoli, F. Pardo, and P. Kormushev, “Action branching architectures for deep reinforcement learning,” CoRR, vol. abs/1711.08946, 2017. [Online]. Available: http://arxiv.org/abs/1711.08946

[178] L. Buesing, T. Weber, Y. Zwols, N. Heess, S. Racaniere, A. Guez, and J.-B. Lespiau, “Woulda, coulda, shoulda: Counterfactually-guided policy search,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=BJG0voC9YQ

[179] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” CoRR, vol. abs/1705.08926, 2017. [Online]. Available: http://arxiv.org/abs/1705.08926

[180] V. François-Lavet, Y. Bengio, D. Precup, and J. Pineau, “Combined reinforcement learning via abstract representations,” CoRR, vol. abs/1809.04506, 2018. [Online]. Available: http://arxiv.org/abs/1809.04506

[181] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate to solve riddles with deep distributed recurrent q-networks,” CoRR, vol. abs/1602.02672, 2016. [Online]. Available: http://arxiv.org/abs/1602.02672

[182] A. Dosovitskiy and V. Koltun, “Learning to act by predicting the future,” CoRR, vol. abs/1611.01779, 2016. [Online]. Available: http://arxiv.org/abs/1611.01779

[183] M. G. Azar and H. J. Kappen, “Dynamic policy programming,” CoRR, vol. abs/1004.2027, 2010. [Online]. Available: http://arxiv.org/abs/1004.2027

[184] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, G. Dulac-Arnold, I. Osband, J. Agapiou, J. Z. Leibo, and A. Gruslys, “Deep q-learning from demonstrations,” 2017.

[185] M. J. Hausknecht and P. Stone, “Deep recurrent q-learning for partially observable mdps,” CoRR, vol. abs/1507.06527, 2015. [Online]. Available: http://arxiv.org/abs/1507.06527

[186] H. R. Maei, C. Szepesvári, S. Bhatnagar, and R. S. Sutton, “Toward off-policy learning control with function approximation,” in Proceedings of the 27th International Conference on International Conference on Machine Learning, ser. ICML’10. Madison, WI, USA: Omnipress, 2010, p. 719–726.

[187] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente, “Multiagent cooperation and competition with deep reinforcement learning,” CoRR, vol. abs/1511.08779, 2015. [Online]. Available: http://arxiv.org/abs/1511.08779

[188] W. Dabney, G. Ostrovski, D. Silver, and R. Munos, “Implicit quantile networks for distributional reinforcement learning,” CoRR, vol. abs/1806.06923, 2018. [Online]. Available: http://arxiv.org/abs/1806.06923

[189] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver, “Mastering atari, go, chess and shogi by planning with a learned model,” 2019.

[190] P. Bacon, J. Harb, and D. Precup, “The option-critic architecture,” CoRR, vol. abs/1609.05140, 2016. [Online]. Available: http://arxiv.org/abs/1609.05140

[191] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Bridging the gap between value and policy based reinforcement learning,” CoRR, vol. abs/1702.08892, 2017. [Online]. Available: http://arxiv.org/abs/1702.08892

[192] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforcement learning in a handful of trials using probabilistic dynamics models,” CoRR, vol. abs/1805.12114, 2018. [Online]. Available: http://arxiv.org/abs/1805.12114

[193] B. O’Donoghue, R. Munos, K. Kavukcuoglu, and V. Mnih, “PGQ: combining policy gradient and q-learning,” CoRR, vol. abs/1611.01626, 2016. [Online]. Available: http://arxiv.org/abs/1611.01626

[194] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3, pp. 229–256, May 1992. [Online]. Available: https://doi.org/10.1007/BF00992696

[195] G. Chen, Y. Peng, and M. Zhang, “An adaptive clipping approach for proximal policy optimization,” CoRR, vol. abs/1804.06461, 2018. [Online]. Available: http://arxiv.org/abs/1804.06461

[196] S. Gu, T. P. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine, “Q-prop: Sample-efficient policy gradient with an off-policy critic,” CoRR, vol. abs/1611.02247, 2016. [Online]. Available: http://arxiv.org/abs/1611.02247

[197] K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y. Yi, “Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning,” 2019.

[198] A. Gruslys, M. G. Azar, M. G. Bellemare, and R. Munos, “The reactor: A sample-efficient actor-critic architecture,” CoRR, vol. abs/1704.04651, 2017. [Online]. Available: http://arxiv.org/abs/1704.04651

[199] H. Liu, Y. Feng, Y. Mao, D. Zhou, J. Peng, and Q. Liu, “Action-depedent control variates for policy optimization via stein’s identity,” 2017.

[200] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans, “Trust-pcl: An off-policy trust region method for continuous control,” CoRR, vol. abs/1707.01891, 2017. [Online]. Available: http://arxiv.org/abs/1707.01891

[201] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. F. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and T. Graepel, “Value-decomposition networks for cooperative multi-agent learning,” CoRR, vol. abs/1706.05296, 2017. [Online]. Available: http://arxiv.org/abs/1706.05296

[202] G. Dulac-Arnold, R. Evans, P. Sunehag, and B. Coppin, “Reinforcement learning in large discrete action spaces,” CoRR, vol. abs/1512.07679, 2015. [Online]. Available: http://arxiv.org/abs/1512.07679

Reinforcement Learning Frameworks – An Overview

Like this:

Related

Leave a Reply

Leave a ReplyCancel reply

Follow me on Twitter

Pages

Categories

Archive

Share this: