Distributional Offline Continuous-Time Reinforcement Learning
with Neural Physics-Informed PDEs
Offline Continuous-Time Reinforcement Learning This paper addresses distributional offline continuous-time reinforcement learning (DOCTR-L) with stochastic policies for high-dimensional optimal control. A soft distributional version of theclassical Hamilton-Jacobi-Bellman (HJB) equation is given by a semilinear partial differential equation (PDE).
This ‘soft HJB equation’ can be learned from offline data without assuming that the latter correspond to a previous optimal or near-optimal policy.
A data-driven solution of the soft HJB equation uses methods of Neural PDEs and Physics-Informed Neural Networks developed in the field of Scientific Machine Learning (SciML).
The suggested approach, dubbed ‘SciPhy RL’, thus reduces DOCTR-L to solving neural PDEs from data.
Our algorithm called Deep DOCTR-L converts offline high-dimensional data into an optimal policy in one step byreducing it to supervised learning, instead of relying on value iteration or policy iteration methods.
The method enables a computable approach to the quality control of obtained policies in terms of both expected returns and uncertainties about their values.
Reinforcement learning (RL) provides a framework for data-driven, learning-based approachesto problems of optimal control.
In addition to relying on data and relaxing the dependenceon a model for dynamics of an environment, RL also offers new computational methods – which becomes especially important for many real-life problems of high-dimensional optimal control.
For such settings, classical methods based on the Bellman equation for discrete-time problems or the Hamilton-Jacobi-Bellman (HJB) equation for continuous-time problems become computationally infeasible.
A tremendous success was achieved in the recent years with using RL methods for many high-dimensional optimal control problems, including e.g. achieving a super-human performance in the game of Go.
These approaches are generally known as deep RL, and are based on a combination of methods of RL with deep neural networks to provide flexible function approximations for learning.
Online vs Offline Reinforcement Learning Methods
Most of existing RL or deep RL algorithms are online methods where an agent has access toits environment, and can explore different policies.
This paper addresses offline RL (also knownas batch-mode RL), where the agent only has the ability to utilize previously collected offline data, but cannot be engaged in any additional online interaction with the environment for the purpose of training.