**Reinforcement Learning** : **Accenture’s Chief Data Scientist on Deep Reinforcement Learning**

**Reinforcement Learning** : Deep reinforcement learning is the most advanced and promising landscape at the forefront of bleeding edge machine learning technology. Reinforcement learning aims to get closer to solving the artificial general intelligence (AGI). Many leading corporations are investing in reinforcement learning for robotics, industrial automation, Internet of Things, and Industrial Internet of Things, a research that springs back from the long cold winter. A number of swarm intelligence and bioinspired algorithms were implemented by a number of enterprises in the recent times such as Fetch.ai. However, to take the AI to the next level of intelligence, AI experts have predicted singularity, high level machine intelligence, or AGI, to be achieved from 30 to 74 years.

A total number of 352 researchers responded to a research survey imitative on the 1634 authors contacted (21%). North Americans predicted that the high-level machine intelligence/AGI can be reached in 74 years.

The intelligent explosion accelerated with the high-level machine learning intelligence and AGI is predicted by Asians in 44 years as opposed to North Americans in 74 years. The most advanced reinforcement learning algorithms have revolutionized AI in 2019 and igniting the movement in 2020. The data scientists demand will rise in reinforcement learning that will launch an advanced career in the field of artificial intelligence and robotics. A review and natural language processing analysis of 16,625 arXiv, a repository of electronic preprints of research papers abstract of the (pronounced as an archive) papers by MIT Technology Review from 1993 through November 2018 has shown that there’s a significant and radical growth in reinforcement learning. The major trends shift towards machine learning, neural networks, and reinforcement learning.

Reinforcement learning

OpenAI continues to make strides in reinforcement learning in 2020 and recently adapted PyTorch as the standard framework for reinforcement learning and deep learning. Though, the optimization of reinforcement learning algorithms is achieved through trial and error method, it’s critical to avoid some errors that can lead to fatalities. Therefore, the safety is critical in reinforcement learning environments, especially when launching spacecraft vehicles onto different surfaces of different planets or their moons, an error should not occur. As we saw in 2019, DeepMind StarCraft, AlphaGo, Arcade Learning environment, Control Suite, and OpenAI Dota have shown superhuman performance. Mathematics with numeric methods and statistical techniques are applied into the environment to optimize the behavior of advanced AI RL systems by way of cumulating the reward function in multiple iterations. Just like teaching a baby the initial steps and similar to how babies learn to walk their own walk on their own after the initial steps, the deep reinforcement learning is expected to revolutionize the robotics field in powering and giving autonomy to robots to walk their own walk without human intervention. The future of reinforcement learning is expected to build advance AI systems in robotics field for multiple industries such as transportation, healthcare, retail, manufacturing, and aerospace.

Figure 1. *Reinforcement learning architecture and algorithms.*

Figure 2. *Open AI constraint elements in safety gym reinforcement learning environment.*

I will cover the architecture of trust region policy optimization algorithm in this interview.

Trust region policy optimization algorithm

Trust Policy Region Optimization (TRPO) is an advanced reinforcement learning algorithm that was designed to overcome the challenges encountered with policy gradients. The policy gradient methods are model-free algorithms designed for the agent to solve the environment with the most optimal behavioral strategy by deriving the most optimal rewards by updating the policy directly. The policy gradient agent observes the continuous state environment (s) , leverages the instincts and performs an action (u). A new state is constructed for every movement in the observation space. However, the curved spaces in the observation have not proven to be highly effective for policy gradients. In order to determine the optimal reward, the policy gradient follows the policy with gradient ascent with the sharpest increment observed in rewards. The policy gradient updates the policy based on the steepest gradient ascent path for the rewards. However, as in any continuous space environment, the training does not go well with approximation especially on the high curvature areas, due to the bad assessment and bad action by the policy gradient agent in the environment. Therefore, the policy gradient performs slower to avoid too large steps and catastrophes with poorly performing policy. TRPO algorithm has been introduced to overcome these challenges with policy gradients, as policy gradient does not scale to large-scale nonlinear policies. TRPO scales and optimizes the policies.

Reviewing Trust Region Policy Optimization algorithm framework

Any algorithm that updates the policy for optimization with frequent iterations tends to provide more guaranteed outcome either increasing the value consistently and never decreasing the value or consistently decreasing and never increasing in the value with monotonic improvement. The TRPO algorithm has been proved to be effectively highly without having to fine-tune the hyperparameters heavily on large neural networks. It has been applied to several tasks on simulated environments such as walking gaits, Atari games, hopping, robotic swimming, and inputting to neural networks. The policy optimization can be broadly classified into three categories:

- Policy iteration methods
- Policy gradient methods
- Derivative-free optimization methods

The policy iteration methods redefine the policies at each iteration and step and perform the computation of the value with a new policy till the convergence of the policy. The policy iteration methods are in contrast to the value-iteration algorithms which improves the policy till the value-function converges. The policy iteration methods provided to be effective when contrasted with value-iteration algorithms as seen in many environments, the number of iterations are significantly lower in policy iteration methods for the convergence of the policy as opposed to the value-iteration methods by solving the linear equations and improving the policy consistently at each state. The policy gradient methods leverage an estimator of the gradient with the approximation value of the expected return based on the sample trajectories.

A surrogate function approximates the objective function with quicker evaluation. A surrogate function searches for a point that minimizes the objective function on millions of points and derives the best approximation as the value to the minimizer of the objective function. Thus, surrogate reduces the time drastically for evaluation. The surrogate leverages objective function evaluations and always tries to find the global minimum of the objective function by attempting to strike a balance between the exploration and speed goals. While, the former attempts to find the global minimum, the latter attempts to find the optimal solution in a minimum number of objective function evaluations with a policy improvement. The trust region policy optimization (TRPO) algorithm performs approximations to the surrogate function in a model-free setting environment namely, single-path method. The vine method is another variant of trust region policy optimization (TRPO) algorithm which is applied to a simulation of the environment when the system is preserved to particular state spaces.

The below figure shows the Sampling trajectories for Trust region policy optimization:

In this diagram, on the left-hand side, the single path procedure is explained in a model-free setting by generating trajectories in continuous state space in an environment simulation where s is the state and a is the action pair. In the single-path method, the objective has been added leveraging the state action pairs (s_{n}, a_{n}). On the right-hand side, a projected trunk of sampling trajectories has been added with the nodes of each trajectory of the reached state. A rollout implementation of individual actions a_{1}, a2 is implemented.

Creating the environment for Acrobot

Acrobot is a simulated environment generating a random agent with a two-link pendulum. In the initial setting, both of the links in the Acrobot spiral downwards and subsequently activating the second link will be the goal. To succeed, the objective or the goal in this reinforcement landscape is to swing the end-effector with a height above the surface. The environment for Acrobot has been set up in a way, that there is no collision among the links even when they have a similar angle. The links can swing freely. Mathematically, the state space contains the sine() and cos() functions of the rotational joint angles and the angular joint velocities [cos(Θ1) sin(Θ1) cos(Θ2) sin(Θ2) Θ.1 Θ.2]. In Acrobot, an angle of 0 for the top link corresponds to the bottom link. The angle of the second link is always relative to the angle of the first link. When there is a continuous state of [1,0,1,0,…], both of the links in Acrobot are pointing downwards. The action expected by the TRPO agent to apply in Acrobot environment is either +1, 0, or -1 torque on the joint between both the pendulum links in the Acrobot environment.

Getting ready

Please follow the instructions in the introduction to get familiarized and acclimatized with the framework behind the TRPO reinforcement learning algorithm for PyTorch roll-out.

How to do it

Let’s follow the below-mentioned steps to create an environment for Acrobot.

- Let’s start importing the packages for the TRPO reinforcement algorithm in PyTorch:

import os

import numpy as np

import torch

import torch.nn as nn

import torch.nn.functional as F

from torch.autograd import Variable

2. Create a new Jupyter notebook for PyTorch called Packt_Acrobot_TRPO.ipynb:

3. Create the Acrobot environment and print the Acrobot’s observational space and action space:

import gym

from time import sleep

environment = gym.make(“Acrobot-v1”)

environment.reset()

observation_space_shape = environment.observation_space.shape

number_of_actions = environment.action_space.n

print(“Acrobot Agent’s Observation Space”, environment.observation_space)

print(“Acrobot Agent’s Action Space”, environment.action_space)

print(“Acrobot Agennt’s Number of actions”, number_of_actions)

How it works

Let’s take a look at what we did in the previous section of code. In step 1, we imported os package, which allows using the operating system directly in the Jupyter notebook to implement any CLI functions as an interface between the notebook and your operating system, namely, Windows, MacOS, or Linux operating system. Os.system() allows executing a shell command directly on your operating system. We also imported the NumPy Python package. NumPy is a scientific computing package, that’s fundamentally performed for large-scale big data analytics for mathematical computations. Ever since NumPy was introduced in 2006, it revolutionized the scientific computing commercial and research communities for the calculation of large and multidimensional arrays based on the main object ndArray dubbed N-dimensional arrays. The declaration of NumPy array is fixed at the time of creating the array, it cannot be expanded dynamically at the runtime. NumPy supports various types of data such as bool_, int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64, float8, float16, float32, flat64, complex_, complex64, and complex128.

The next library Torch is a library of PyTorch, which is a root library of PyTorch. However, NumPy multi-dimensional arrays cannot run with the power of GPUs. The tensors in PyTorch are similar to the ndArrays we discussed earlier, in addition to the multi-dimensional capabilities, they can be accelerated rapidly on GPUs for the large-scale GPU computing at a commercial and enterprise scale for many applications such as drug discovery, chemistry, genomics, supply chain management, edge computing, retail, and healthcare. The library module torch.nn stands for neural network library with tighter integration of another PyTorch library torch.autograd. Traditionally, NumPy does not generate dynamic computation graphs or understand the gradients. However, NumPy allows the data scientists to implement the neural network layers manually with forward and backward passes. The modern reinforcement learning landscape with large-scale big data requires GPU acceleration. The NumPy cannot perform such intuitive mathematical computations of the array on GPUs. The GPU acceleration has shown 50x exploration and speed on PyTorch torch library when compared with NumPy operations performed on CPUs.

The torch.nn library is designed for defining and creating neural network layers, that allows performing large-scale operations based on the number of layers defined for training the neural networks. The autograd library in PyTorch can retrieve and access the tensor operations with computational graphs and identify the nodes as the tensors. The edges on the computation graph nodes are the output tensors generated from the input tensors. When contrasted with TensorFlow, the PyTorch computes dynamic computational graphs as opposed to TensorFlow that can only create static computational graphs for accessing the gradients and scalar values from the edges. The torch.nn.functional is designed to implement in cases, where there are no configurable or trainable parameters that come with simple operations.

In Step 2, in your operating system either MacOS, Linux, or Windows OS, you create a Jupyter notebook for writing the code in PyTorch. You can name the file based on your project nomenclature. The Jupyter notebook can be created from Anaconda environment. We will cover the installation of Anaconda and PyTorch on multiple operating systems such as MacOS, Linux, or Windows OS in the Appendix section. The Jupyter notebook is a web-based interface for writing your PyTorch code. The rise of the popularity of Jupyter notebooks in the research, scientific, and commercial communities has seen a significant spike in recent times for code sharing with data visualizations and documentation with an interactive environment. The Jupyter notebook supports a variety of kernels and programming languages such as Fortran, Ansible, Babel, C++, Clojure, CoffeeScript, CSharp, Elixir, Erlang, Forth,Haskell, Hy, Java, JavaScript, Julia, Jython, Kotlin, Lua, Matlab, NodeJS, OCaml, Octave,Perl, Php,Prolog, PyTorch, Python, R, Redis, Ruby, Rust, Scala,Scilab, Stata, Teradata SQL, TypeScript, and Wolfram Mathematica.

In Step 3, the gym environment has to be installed. The gym library is a collection of environments to design, develop, and compare reinforcement learning algorithms for a possible solution. The gym environment can be installed on MacOS, Linux, or Windows OS leveraging the command pip install gym . Once the installation is complete, it can be imported with the command import gym as a library. The “Acrobot-v1” environment can be instantiated with the gym.make(“Acrobot-v1”) and assigned to the environment. The reset() function can be applied to reset any initial actions and offsets the observations. Every environment has observation_space and action_space, that are the attributes of the type space. The output of the observation_space and action_space for the Acrobot environment appears as follows when printed. The box space represents an n-dimensional box. The observation space is the space object that contains the valid observations and the discrete action space contains the valid actions from the environment as shown in the below figure:

Visualizing the Acrobot environment with a plot

To display how Acrobot environment works with a two-link pendulum or a two-link robotic arm in a vertical plane against the gravity, the environment can be rendered from the Jupyter notebook as a separate window or as a plot from the leveraging matplotlib data visualizations. It has two-degree freedom, but works with a single actuator. To overcome the challenge of the under actuated robotic environment, to swing up and strike a balance of the entire system, the controller of the Acrobot must be able to achieve coupling between the unactuated degree of freedom with the actuated degree of freedom.

Getting ready

Please follow the instructions in creating the Acrobot environment to get familiarized and acclimatized with the Acrobot robotic environment.

How to do it

Let’s follow the below-mentioned steps to visualize the environment for Acrobot:

- Let’s render the Acrobot environment into a pop-up window:

for i in range(1000):

environment.reset()

environment.step(environment.action_space.sample())

environment.render()

sleep(0.03)

environment.close()

2. Let’s render the environment by visualizing and plotting into the Jupyter notebook with matplotlib plot.

import matplotlib.pyplot as plt

%matplotlib inline

plt.imshow(environment.render(‘rgb_array’))

**The Future of Natural Language Processing**

How it works

In step 1, we render the Acrobot environment through a pop-up window by declaring 1000 frames or sooner to render the environment by running the random Acrobot agent. We reset the environment to initial state so that all the variables for the Acrobot environment have been reset before we render the environment. The step for each action space sample will instantiate the Acrobot agent and render the environment for 1000 timesteps at each step. As a result, the Acrobot environment will display by representing the classic problem of Acrobot. The Acrobot environment will go-off the Jupyter notebook screen and run as a separate pop-up window to ensure that the Acrobot environment is up and running. You can also start running and validating other environments to understand the framework such as CartPole-v0, Mspacman-v0, MountainCar-v0, Hopper-v1 environments. Some of these environments may require packages and additional environments installed such as Mujoco or Atari games. We will visit other environments in the other chapters. This recipe will be focused on Acrobot environment only for the implementation of TRPO algorithm.

The below figure shows the Acrobot runtime reinforcement learning environment pop-up window:

In step 2, we render the Acrobot environment through matplotlib data visualization. matplotlib.pyplot is the core object that encapsulates all the methods that generate and produce the charts and features in a data visualization of the plot. %matplotlib inline is a particular command that works in Jupyter notebook to run the plots within the Jupyter notebook without having to go off the screen as a pop-up window. While show() function with plt object can be implemented as a scatterplot. In this case, we will leverage imshow() function rending the environment’s data in an array with RGB values to display on a 2D regular raster.

The below figure shows the Matplotlib plot for the Acrobot reinforcement learning environment:

Creating a neural network for AcrobotTRPO agent

A neural network is designed based on the biological structure of the brain that receives multiple inputs as neurons and processes the neurons through hidden layers to produce an output through a function. We will create a 4-layer fully connected deep neural network with an input layer and two hidden layers and an output layer by creating a class for the torch neural network module nn.module as shown in the below figure:

Getting ready

Please follow the instructions in creating the neural network for AcrobotTRPO agent section to get familiarized and acclimatized with the neural network framework. Creating the neural network is the core of the TRPO agent.

How to do it

- We will create a neural network with the torch.nn package here.

class AcrobotTRPOAgent(nn.Module):

def __init__(self, continuous_state_space, number_of_actions, hidden_layer_size = 32):

nn.Module.__init__(self)

self.fullyconnectednn1 = nn.Linear(continuous_state_space[0], 128)

self.fullyconnectednn2 = nn.Linear(128, hidden_layer_size)

self.fullyconnectednn3 = nn.Linear(hidden_layer_size, number_of_actions)

2. We will compute the output for the neural network with the forward method with the inputs.

def forward(self, states):

x = F.relu(self.fullyconnectednn1(states))

x = F.relu(self.fullyconnectednn2(x))

x = F.log_softmax(self.fullyconnectednn3(x))

log_probabilities = x

return log_probabilities

- Next, we will get log probabilities for training the agent.

def get_log_probs(self, states):

return self.forward(states)

- Get log probabilities for interacting with the Acrobot environment.

def get_probs(self, states):

return torch.exp(self.forward(states))

- Now define the action method to get the actions for a given state based on the current policy.

def act(self, obs, sample=True):

probs = self.get_probs(Variable(torch.FloatTensor([obs]))).data.numpy()

if sample:

action = int(np.random.choice(number_of_actions.p=probs[0]))

else:

action = int(np.argmax(probs))return ation, probs[0]

TRPOagent = AcrobotTRPOAgent(observation_space_shape, number_of_actions).to(device).

How it works

Let’s take a look at what we did in the previous section of code recipe. In step 1, we created a fully connected neural network with four-channels. We will pass the state space of Acrobot environment as the input for the neural network. This input from the neural network gets processed through the neurons and then it gets passed to the two fully connected hidden layers with 32 and 128 nodes, each node leveraging a ReLU activation function. As the data gets passed from the input layer through the hidden layer, it produces the output layer with the number of actions in the environment, which is a part of the action space of the Acrobot environment. The first layer of fully connected neural network takes the state space as the input and connects to the first 128 nodes of the hidden layer, (you can increase the nodes from 100-500 in your example to make a comparison of neural network performance). Then in the second step, it connects 128 to 32 hidden layer and finally in the third layer, it connects the last hidden layer with size 32 to the output layer with the number of actions.

In step 2, the forward method passes the input of the neural network through the linear layer with states as a primary argument and then passes that input through the log_softmax through ReLU function, which is a rectifier that works as the activation function of each neuron from the input. Each unit is to perform this activation dubbed ReLU (rectified linear unit). We will apply the softmax (also known as a smooth approximation of arg max. ) output layer to compute the log probabilities of the environment. The softmax function basically takes the scrambled vector and converts it into a probability distribution. We feed states into our first connected layer (self.fullyconnectednn1(states)) and then implements the ReLU function with the function f.relu(). We apply this taking x as the primary argument for all the layer, except for the last layer we return the log probabilities by normalizing the non-normalized vectors into a probability distribution.

In step 3, we construct the method get_log_probs() a symbolic operation by passing the states of the environment to compute the log probability of a set of the actions that were implemented.

In step 4, we construct the method get_probs() a symbolic operation by passing the states of the environment to return the log probabilities distributions for the states for interaction.

In step 5, the action method passes the parameters self, observation vector and returns the sample actions from the policy distribution, where the sample action value is found to be true. If the action method does find the sample action as true, it considers the most probable value for the sample as false.

Determining log probabilities for training the TRPO Agent

While creating the neural network method, we’ve supplied the input data as states and then applied the ReLU activation function to each of the neural network layers with the f.ReLU() function. We replaced the value of x at each pass, passing that computed data to the next layer and then we returned the log probability distribution.

In the action method, we passed the observation state space and number of the actions of Acrobot Agent to variable TRPOAgent.

Getting ready

Please follow the instructions below to print out the sample probabilities.

How to do it

- Let’s return the log probabilities from TRPOagent and print out the results from Thompson sampling and Epsilon-greedy strategies.

log_probabilities = TRPOagent.get_log_probs(Variable(torch.FloatTensor([environment.reset()])))

assert isinstance(log_probabilities, Variable) and log_probabilities.requires_grad

assert len(log_probabilities.shape) == 2 and log_probabilities.shape[0] == 1 and log_probabilities.shape[1] == number_of_actions

sums = torch.sum(torch.exp(log_probabilities), dim=1)

assert (0.999<sums).all() and (1.001>sums).all()

print (“Thompson Sample:”, [TRPOagent.act(environment.reset()) for _ in range(5)])

print (“Epilson-Greedy:”, [TRPOagent.act(environment.reset(),sample=False) for _ in range(5)])

How it works

In step1, we return the actions samples from the TRPOagent model and all the probabilities associated with each of the action. We have assigned the log probabilities from TRPOagent and we print out the observation space states with Thompson sample and Epison-greedy samples to determine if the log probabilities meet all the requirements of Acrobot environment for the actions sampled as shown below:

Getting and setting parameters

In this recipe, we will flatten the model parameters leveraging the torch.cat() function.

Getting ready

Please follow the instructions below to flatten out the model parameters.

How to do it

1. Let’s flatten out the neural network model leveraging torch.cat() function.

def get_flat_params_from(model):

params = []

for param in model.parameters():

params.append(param.data.view(-1))flat_params = torch.cat(params)

return flat_params

def set_flat_params_to(model, flat_params):

prev_ind = 0

for param in model.parameters():

flat_size = int(np.prod(list(param.size())))

param.data.copy_(

flat_params[prev_ind:prev_ind + flat_size].view(param.size()))

prev_ind += flat_size

How it works

In step 1, we leverage torch.cat() function that concatenates the sequence of the parameters that have similar tensors with the same shape. The parameters can be tensors, input dimensions and output dimensions for torch.cat() function. The get_flat_params() method returns the parameters we can set later for the update step method. We pass the model and flat_params returned from get_flat_params() and define the set_flat_params_to() method.

Computing and printing sample cumulative returns

In this recipe, we will compute the cumulative rewards of the agent. The ultimate goal of the TRPOagent is to maximize the cumulative rewards the agent receives in the lifetime of the environment. We will compute and print out the cumulative returns of the agent.

Getting ready

Please follow the instructions to compute and print the sample cumulative returns at discount rate.

How to do it

1. Let’s import the signal toolbox from scipy package and define the get_cumulative_returns() function and return the cumulative rewards.

import scipy.signal

def get_cummulative_returns(r, gamma=1):

r = np.array(r)

assert r.ndim >= 1

return scipy.signal.lfilter([1], [1, -gamma], r[::-1], axis=0)[::-1]

2. Let’s print out the cumulative rewards received by the agent.

get_cummulative_returns([0,0,1,0,0,1],gamma=0.9)

How it works

In step1, we’re importing signal toolbox from scipy package to calculate the cumulative returns of the agent.

We can define the sequence of the rewards in a time epoch as st as a cumulation of CR_{st+1,} CR_{st+2, }CR_{st+3}, CR_{st+4}, CR_{st+5}, CR^{st+6}, CR_{st+7}, CR_{st+8}, CR_{st+9}, CR_{st+10}, CR_{st+11}, CR_{st+12…..} The expected return can be denoted as GCR_{st.}

The equation can be represented as follows to compute the sum of the rewards:

GCR_{st} = CR_{st+1,} CR_{st+2, }CR_{st+3}, CR_{st+4}, CR_{st+5}, CR^{st+6}, CR_{st+7}, CR_{st+8}, CR_{st+9}, CR_{st+10}, CR_{st+11}, CR_{st+12+ ….+ }CR_{ST.}

The ST in the above equation will be the final time step. The interactions of the agent could be broken into multiple episodes and finally reaches the final time step dubbed CR_{ST}. When the agent runs through each episode finally reaches a special state dubbed terminal state. The agent runs through iterative episodes with different rewards and different outcomes. Every time, the agent reaches the terminal state in an episode, the agent begins a new episode with a reset of the previous episodic parameters. Hence, we’re computing the cumulative discounted rewards. In order to consider the discounted return, the agent tries to maximize the rewards by summing up the discounted rewards over a long arc of time. The sum of the discounted rewards can be represented as:

^{GCR}_{st}^{ = CR}_{st+1,}^{ CR}_{st+2, }^{CR}_{st+3+………. =} ^{γk CR}_{st +}^{ k + 1.}

^{γ} (gamma) is a discount rate parameter, 0 ≤ γ ≤ 1.

The γ, discounted rate determines the present value of expected future rewards. Any reward that’s received in the future for k time steps γ ^{k -1} worth, if the reward were to be received instantly by the agent in the current episode.

We will print the cumulative returns with discounted rates for the agent as follows:

Generating policy rollout for training the TRPO agent

In this recipe, we will produce a policy for the rollout for training the TRPOagent. We will be updating the policy based on the paths generated from the rollout.

Getting ready

Please follow the instructions below to generate rollout.

How to do it

- Let’s create a process_rollout method to generate the paths for observations, policy, actions, rewards, and cumulative returns with the discount rate.

def process_rollout(environment, TRPOagent, max_pathlength=2500, n_timesteps=50000):

paths = []

total_timesteps = 0

while total_timesteps < n_timesteps:

rollout_observations, agent_actions, rewards, action_probs = [], [], [], []

rollout_observation = environment.reset()

for _ in range(max_pathlength):

action, policy = TRPOagent.act(rollout_observation)

rollout_observations.append(rollout_observation)

agent_actions.append(action)

action_probs.append(policy)

rollout_observation, reward, done, _ = environment.step(action)

rewards.append(reward)

total_timesteps += 1

if done or total_timesteps==n_timesteps:

path = {“Observations”: np.array(rollout_observations),

“Policy”: np.array(action_probs),

“Actions”: np.array(agent_actions),

“Rewards”: np.array(rewards),

“Cumulative_returns”:get_cummulative_returns(rewards),

}

paths.append(path)

break

return paths

How it works

In step 1, we will write our implementation of the process rollout method. Your method should accept four arguments to the process_rollout() as input:

- environment: We created an instance environment at the beginning of the Acrobot environment, this variable is the instance of the Acrobot environment
- TRPOagent: This variable will contain the actions for the given state of actions based on the current policy
- max_pathlength: The variable contains the value of the maximum length of the path sizes to be generated
- n_timesteps: This variable represents the number of time steps in the environment

The method should return the output as generated paths.

In step 1, we create a process_rollout method with the variables as the environment, actions, maximum path length, and the number of time steps. We default 2500 for maximum path length and number of time steps as 50,000. The number of time steps represents the total sum of path sizes we generate in the environment. The action parameter returns the policy and each action for the observation. The environment parameter we will generate actions for the number of rollouts. The method is created for training the agent with the rollout.

Earlier we’ve assigned the actions to TRPOagent. We loop through the total number of timesteps not exceeding the number of time steps we defined for a range of 2500 path length. We collect the observations, actions, rewards and generate the paths for observations, policy, rewards, and cumulative returns.

Setting the paths for the policy rollout

In this recipe, we will set the paths for the policy rollout, which prints the sample data for verification.

Getting ready

Please follow the instructions that will print out the generated paths.

How to do it

- Let’s create the paths for the generated rollout.

paths = process_rollout(environment,TRPOagent,max_pathlength=5,n_timesteps=100)

print (paths[-1])

assert (paths[0][‘Policy’].shape==(5, number_of_actions))

assert (paths[0][‘Cumulative_returns’].shape==(5,))

assert (paths[0][‘Rewards’].shape==(5,))

assert (paths[0][‘Observations’].shape==(5,)+observation_space_shape)

assert (paths[0][‘Actions’].shape==(5,))

print (‘No errors detected’)

How it works

In step1, we set the environment, actions, maximum path length (default to 5), number of steps (defaulted to 100) for generating the output to visualize the observations, actions, rewards, and cumulative returns as shown below:

Defining auxiliary functions for training the TRPO agent

In this recipe, we will create an auxiliary function to compute the surrogate loss and return the importance-sampled policy gradient.

Getting ready

Please follow the instructions to compute the surrogate loss.

How to do it

- Let’s create get_loss() function to compute the surrogate loss or the importance-sampled policy gradient.

def get_loss(TRPOagent, agent_observations, agent_actions, cummulative_returns, old_probs):

batch_size = agent_observations.shape[0]

log_probs_all = TRPOagent.get_log_probs(agent_observations)

probs_all = torch.exp(log_probs_all)

probs_for_actions = probs_all[torch.arange(0, batch_size, out=torch.LongTensor()), agent_actions]

old_probs_for_actions = old_probs[torch.arange(0, batch_size, out=torch.LongTensor()), agent_actions]

Loss = torch.mean(((probs_for_actions/ old_probs_for_actions) * cummulative_returns),dim = 0, keepdim = True)

Loss = -Loss

assert Loss.shape == torch.Size([1])

return Loss

How it works

In step 1, we compute the surrogate loss by defining the surrogate objective and maximizing the subject constraint dependent on the size of the TRPO policy.

Please replace the image with actual formula equation with variables with state (in place of s), (action in place of a), capital A with B), (K in place of capital G).

The surrogate reward can be computed with R_{surr} and the cumulative discount returns can be represented as J_{surr.} The πθ_{old} can be represented of the TRPO policy parameters before performing the update. The πθ can be considered as the vector parameter after the update. This problem of computing the surrogate loss can also be computed with conjugate gradients. Here, we declare the batch job based on the agent observations and compute the log probabilities from the earlier returned log probabilities from get_log_probs() function. The probabilities for actions are computed based on the agent’s actions. The surrogate loss can be computed as the mean of probabilities for actions divided by the old probabilities for the actions before the update multiplied by the cumulative discounted returns and returns the scalar value of the surrogate loss dubbed importance-sampled policy gradient through the function get_loss().

Defining Kullback–Leibler divergence method for neural network policy

Kullback-Leibler divergence is an application of probability theory, that is part of mathematical statistics, a deeper branch of mathematics. The Kullback-Leibler divergence dubbed relative entropy provides a measure to compare two probability distributions. The Kullback-Leibler divergence emerges from the information theory and probability theory. It can be applied to reinforcement learning to return the estimate of the average Kullback-Leibler divergence.

Getting ready

Please follow the instructions in this recipe to compute the Kullback-Leibler divergence.

How to do it

- Let’s create get_kl() function to compute the Kullback-Leibler divergence.

def get_kl(TRPOagent, agent_observations, agent_actions, cummulative_returns, old_probs):batch_size = agent_observations.shape[0]

log_probs_all = TRPOagent.get_log_probs(agent_observations)

probs_all = torch.exp(log_probs_all)

old_log_probs = torch.log(old_probs+1e-10)

kl = torch.mean(torch.sum(old_probs * (old_log_probs – log_probs_all), 1), dim = 0, keepdim = True)

#print(kl.shape)

assert kl.shape == torch.Size([1])

assert (kl>-0.0001).all() and (kl<10000).all()

return kl

How it works

In step 1, we compute the Kullback-Leibler divergence. We pass the number of observations, the number of actions, cumulative discount returns and old probabilities of old network policy. Let p(k) and q(k) represent the probability distributions for old network policy and current network policy. We compute the Kullback-Leibler divergence of q(k) from p(k) as a loss of information when q(k) is leveraged to approximate the probability of p(k). We can represent it as D_{KL} (p(k), q(k)). D_{KL} provides the Kullback-Leibler divergence for the discrete random variable k. The sum of p(k) and q(k) must sum up to 1 and p(k) > 0 and q(k) > 0 for k in K.

Let’s represent D_{KL} in the following equation:

In this scenario, p(k) represents the actual distribution of old network policy, all observations, and actions, not necessarily only the actions performed by the agent, rather all the actions over the entropy. Now, we approximate the measure with q(k) as a network model based on the approximation of p(k). G is the set of all positive variables over a random variable k.

The continuous distribution of KL divergence can be represented as follows:

We assign all the batch of observations in the environment to batch_size variable, assign the log probability distributions obtained from the method get_log_probs() to log_probs_all. We assign the old probabilities of the network to old_probs variable, then we compute Kullback-Leibler divergence based on the above formula we defined and returns a scalar value of estimate of the average KL divergence to kl.

Defining the entropy method for the neural network policy

In the previous recipes, we have computed the log probabilities of the agent’s actions, not only the agent’s actions, but all the actions across the entropy in the environment based on the conditional state of the environment p (a | s), where a stands for actions, and s stands for state of the environment. We can represent the possible events of actions from 1 through n with the probabilities of (p_{1}, p_{2}, p_{3},……. p_{n}). The agent takes a discrete action based on the number of actions available in the environment. A categorical probability distribution is leveraged for the action chosen by the agent. In an environment, mean and standard deviations can be computed, where there’s continuous control agent with Gaussian distribution. There’s randomness in the actions rolled out by the agent in the environment due to uncertainty. This randomness can be represented as the entropy of the probability distribution. Entropy finds its origins in the information theory and physics. Based on the TRPO policy set in the environment, there’s an amount of randomness and chaos in the environment that brings the unpredictability on the set of actions the agent will take. We compute the entropy in the environment in this recipe.

Getting ready

Please follow the instructions in this recipe to compute the entropy of the environment.

How to do it

Let’s create get_entropy() method to derive the entropy in the environment.

def get_entropy(TRPOagent, agent_observations):

agent_observations = Variable(torch.FloatTensor(agent_observations))

batch_size = agent_observations.shape[0]

log_probs_all = TRPOagent.get_log_probs(agent_observations)

probs_all = torch.exp(log_probs_all)

entropy = torch.sum(-probs_all * log_probs_all) / batch_size

entropy = entropy.unsqueeze(0)

assert entropy.shape == torch.Size([1])

return entropy

How it works

In step 1, we compute the entropy of the environment. The entropy of a discrete probability distribution can be represented with the following equation.

In the reinforcement learning landscape, the objective of the algorithm is to reduce the chaos by optimizing the sum of the cumulative discounted rewards. This will reduce randomness in the environment. In the above equation, entropy H can reach zero if exactly one of the possible actions in the environment has the probability of one and the probability of agent not taking action on rest of the events is zero. When p(y) has been maximized, the probability of p, then the entropy will be maximized.

We capture the value of agent observations into agent_obervations variable, and then assign the batch size and derive the log probability distribution for all of the agent observations in the environment and compute it into probs_all variable. We compute and return the scalar value of the entropy from this method.

Defining the linear search method for determining the optimal parameters

In reinforcement learning, linear search implemented to find the most optimized parameters of the neural networks in the direction of full step constrained by Kullback-Leibler divergence. Backtracking the line search for TRPO has to be considered carefully because due to the quadratic approximation KL divergence can be impacted with degradation of performance. Hence, it requires improvement in the surrogate computed by enforcing the Kullback-Leibler divergence constraint.

This function returns the parameter vector given by a linear search.

Getting ready

Please follow the instructions in this recipe to compute the loss and determine the optimal parameters for the neural network.

How to do it

- Let’s compute the loss in the neural network with a linear search. (Note to self: If time is available rewrite the code)

def linesearch(f, x, fullstep, max_kl):

max_backtracks = 10

loss, _, = f(x)

for stepfrac in .5**np.arange(max_backtracks):

xnew = x + stepfrac * fullstep

new_loss, kl = f(xnew)

actual_improve = new_loss – loss

if kl.data.numpy()<=max_kl and actual_improve.data.numpy() < 0:

x = xnew

loss = new_loss

return

How it works

Your method should accept four arguments to the linesearch() as input:

- f: f is a function that returns the loss of the neural network
- x: x is the old parameter of the neural network policy
- full_step: The variable contains the value of the direction in which we perform the search.
- max_kl: This variable represents the constraint of Kullback-Leibler divergence

The method should return the output as x that is the most optimized new parameters for the neural network policy.

In step 1, we compute the loss by backtracking the linear search and find the improvement in the surrogate.

We compute the proposed policy steps and then compute the proposed for the set of the trajectory of actions.

We define f as a function that returns the loss of the neural network and x as the old parameters of the neural networks, full step as the direction in which we perform the linear search, max_kl can be defined as the constraint of Kullback-Leibler divergence.

Defining the conjugate gradients method to solve linear equation

A conjugate gradient method is implemented to optimize a quadratic equation or to solve a linear equation.

Many research papers have shown that the conjugate gradient method has proved to be effective when contrasted with the gradient descent algorithm. It is an iterative method for solving large-scale systems of linear equations and applied in reinforcement learning. The equation can be represented as Ax = b. In this equation, b is a known vector and x is an unknown vector. ‘A’ is a known symmetric Hessian matrix and positive-definite.

Getting ready

Please follow the instructions in this recipe to solve the linear equation.

How to do it

- Let’s solve the linear equation Ax = B with the conjugate gradient method.

from numpy.linalg import inv

def conjugate_gradient(f_Ax, b, cg_iters=10, residual_tol=1e-10):

p = b.clone()

r = b.clone()

x = torch.zeros(b.size())

rdotr = torch.sum(r*r)

for i in range(cg_iters):

z = f_Ax(p)

v = rdotr / (torch.sum(p*z) + 1e-8)

x += v * p

r -= v * z

newrdotr = torch.sum(r*r)

mu = newrdotr / (rdotr + 1e-8)

p = r + mu * p

rdotr = newrdotr

if rdotr < residual_tol:

break

return x

How it works

Your method should accept four arguments to the conjugate_gradient() as input:

- f_Ax: It’s the symmetrical Hessian matrix we defined for A as positive-definite for an unknown vector x.
- b: b is a known vector
- cg_iters: This variable will contain the number of iterations conjugate method will go through.
- residual_tol: The variable contains the value set for residual tolerance.

The method should solve the linear equation Ax=b. In step 1, we define the conjugate gradient iterations by defining the f_Ax the symmetric positive-definite Hessian matrix and set the residual tolerance limit. We iteratively loop through the conjugate gradient method for a number of iterations defined and solve the linear equation Ax = b.

Validating conjugate gradients

In this recipe, we will validate the conjugate gradients to solve the linear equation Ax = b.

Getting ready

Please follow the instructions in this recipe to validate the linear equation for the conjugate gradients.

How to do it

Let’s validate the conjugate gradients.

A = np.random.rand(8, 8)

A = np.matmul(np.transpose(A), A)

def f_Ax(x):

return torch.matmul(torch.FloatTensor(A), x.view((-1, 1))).view(-1)b = np.random.rand(8)

w = np.matmul(np.matmul(inv(np.matmul(np.transpose(A), A)), np.transpose(A)), b.reshape((-1, 1))).reshape(-1)

print (w)

print (conjugate_gradient(f_Ax, torch.FloatTensor(b)).numpy())

How it works

We print out the conjugate gradient values to validate as shown:

Constructing the update step method

In this recipe, we will construct the update step to prepare the TRPO for the training. We will update the step for TRPO here by passing the TRPO agent, and observation space, action space, cumulative returns gathered, old probabilities, and maximum KL divergence obtained. Here, we adjust the policy with policy update step for the sampling trajectory of the Acrobot agent in the reinforcement learning environment. However, the policy update occurs for the entire trajectory. We do not update the policy based on the each time step in the environment, as large-policy changes by each time step can create inappropriate learning rate curves with degraded sample efficiencies.

Getting ready

Please follow the instructions in this recipe to construct the update step method for TRPO.

How to do it

Let’s start constructing the update step method for TRPO.

def update_step(TRPOagent, agent_observations, agent_actions, cummulative_returns, old_probs,

max_kl):

agent_observations = Variable(torch.FloatTensor(agent_observations))

agent_actions = torch.LongTensor(agent_actions)

cummulative_returns = Variable(torch.FloatTensor(cummulative_returns))

old_probs = Variable(torch.FloatTensor(old_probs))

loss = get_loss(TRPOagent, agent_observations, agent_actions, cummulative_returns, old_probs)

grads = torch.autograd.grad(loss, TRPOagent.parameters())

loss_grad = torch.cat([grad.view(-1) for grad in grads]).datadef Fvp(v):

kl = get_kl(TRPOagent, agent_observations, agent_actions, cummulative_returns, old_probs)

grads = torch.autograd.grad(kl, TRPOagent.parameters(), create_graph=True)

flat_grad_kl = torch.cat([grad.view(-1) for grad in grads])

kl_v = (flat_grad_kl * Variable(v)).sum()

grads = torch.autograd.grad(kl_v, TRPOagent.parameters())

flat_grad_grad_kl = torch.cat([grad.contiguous().view(-1) for grad in grads]).data

return flat_grad_grad_kl + v * 0.1

stepdir = conjugate_gradient(Fvp, -loss_grad, 10)

shs = 0.5 * (stepdir * Fvp(stepdir)).sum(0, keepdim=True)

lm = torch.sqrt(shs / max_kl)

fullstep = stepdir / lm[0]

neggdotstepdir = (-loss_grad * stepdir).sum(0, keepdim=True)

prev_params = get_flat_params_from(TRPOagent)

def get_loss_kl(params):

set_flat_params_to(TRPOagent, params)

return [get_loss(TRPOagent, agent_observations, agent_actions, cummulative_returns, old_probs),

get_kl(TRPOagent, agent_observations, agent_actions, cummulative_returns, old_probs)]

new_params = linesearch(get_loss_kl, prev_params, fullstep, max_kl)

set_flat_params_to(TRPOagent, new_params)

return get_loss_kl(new_params)

How it works

Your update step method should be able to accept the following six parameters as input:

- TRPOagent: This variable will contain the actions for the given state of actions based on the current policy
- agent_observations: This variable contains the batch of observation
- agent_actions: This variable contains a batch of actions
- cummulative_returns: This variable contains the sum of a batch of cumulative discounted returns
- old_probs: This variable contains a batch of log probabilities computed by the old network policy
- max_kl: This variable controls how big KL divergence may be between old and new policy every step.

Your method should return Kullback-Leibler divergence as output between new network policy and old network policy and the computed value of the loss function.

Training the TRPO agent and printing rewards

In this final recipe, we will train the TRPO agent and print the rewards. We loop the environment with episodes and compute the Kullback-Liebler divergence between old and new probability distributions and cumulate the episodic rewards for each time step till we see the convergence. We also compute the surrogate loss and glean the entropy per each episode.

Getting ready

Please follow the instructions in this recipe to the train the TRPO agent and print out the results to solve the environment.

How to do it

1. Let’s start training the network with our TRPO policy agent.

import time

from itertools import count

from collections import OrderedDict

max_kl=0.01

numeptotal = 0

start_time = time.time()for i in count(1):

print (“\n***** Epoch %i *****” % i)

print(“Acrobot environment rollout”)

paths = process_rollout(environment, TRPOagent)

print (“Environment simulation”)agent_observations = np.concatenate([path[“Observations”] for path in paths])

agent_actions = np.concatenate([path[“Actions”] for path in paths])

returns = np.concatenate([path[“Cumulative_returns”] for path in paths])

old_probs = np.concatenate([path[“Policy”] for path in paths])

loss, kl = update_step(TRPOagent, agent_observations, agent_actions, returns, old_probs, max_kl)

episode_rewards = np.array([path[“Rewards”].sum() for path in paths])

stats = OrderedDict()

numeptotal += len(episode_rewards)

stats[“Total number of episodes”] = numeptotal

stats[“Average sum of rewards per episode”] = episode_rewards.mean()

stats[“Std of rewards per episode”] = episode_rewards.std()

stats[“Time elapsed”] = “%.2f mins” % ((time.time() – start_time)/60.)

stats[“Kullback–Leibler divergence between old and new probability distribution”] = kl.data.numpy()

stats[“Entropy”] = get_entropy(TRPOagent, agent_observations).data.numpy()

stats[“Surrogate loss”] = loss.data.numpy()

for k, v in stats.items():

print(k + “: ” + ” ” * (40 – len(k)) + str(v))

i += 1

How it works

In step 1, we set up the hyperparameter max_kl for the TRPO algorithm. This hyperparameter max_kl drives how the Kullback-Leibler divergence should be executed and computed for every step executed. We also initialize the numpetotal as it will compute the number of epochs the agent played in the Acrobot environment. We train the TRPO in a loop incrementally and start printing out the rollouts the agent made.

We print out each episode with a total number of rewards, average sum of rewards per each epoch, time elapsed for each step, the standard deviation of rewards per episode, the KL divergence between the old and new probability distributions, an entropy of the environment and surrogate loss.

Unlike the other environments, Acrobot is an unsolved environment, as it contains two joints and two links where the join is actuated between the links. The objective of TRPO agent is to rotate and swing up the lower link to a given height. Therefore, we will not see the particular point where the Acrobot environment is solved leveraging the TRPO agent.

Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement LearningDeep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning Deep Reinforcement Learning : Accenture’s Chief Data Scientist on Deep Reinforcement Learning