Reinforcement Deep Learning vs. Deep Learning

Reinforcement Deep Learning vs. Deep Learning

What is Deep Learning?

  The goal of both Deep Learning and Reinforcement Deep Learning builds off of a complication in the objective function from Machine-Learning techniques:

  In many cases we can have characteristics in data that have to pass through multiple ‘levels of testing’ to contribute to a classification or regression solution. This is beyond the scope of Machine Learning models, and why we have begun to move to Deep Learning for more complex topics.

  Consider a model that takes predictive inputs:

  that belong to some pattern of data ‘D’. ‘D’ can take the form of basically any type of data that can be modelled with a set of metrics.

  It’s easy to see that a single Deep Learning model definition can’t fit every use-case. Our optimal model is as accurate as possible at predicting the real value of ‘D’: 

  given an optimal set of node weights ‘W’ and an activation function ‘f(X(D))’.

  The ‘levels of testing’ mentioned above are called nodes – they dictate whether a value of ‘x’ contributes to a prediction and can be flipped on/off by the model. To maximize a given model’s performance we need to find an optimal combination of node weights 

  Let’s define the performance of our model given a vector of node weights as ‘F(W)’ – this is a learning function. The change in performance from an old vector of weights to a new vector is traditionally thought of as a 3-axis gradient and is constructed from tuning a set of parameters. The first axis measures the model using the current vector of weights and a learning rate; This learning rate dictates what size step to take when you move along the gradient. These two terms as a set are types of parameters. Now consider a new set made up of weights and a learning rate that are not identical to the old set of parameters – this is the second axis. Our third and final axis consists of performance – F(W). We have now fully defined all axes dimensions of a model’s gradient. 

  We introduce two general constraints to coax the gradient into converging to a solution:

1. We only want to consider a step if it improves model performance, e.g. ΔF>0.

2. At any given fork, we want to take an optimal step along the gradient of weights that improves our model.

  There is a wealth of well-known methods that exist to calculate an optimal such as Bayesian Optimization, Grid Searching, and Orthogonal Array Tuning^1 (2019) just to name a few. Each technique has its own trade-offs between aspects such as pre-processing difficulty, computational cost, and level of convergence.

  Our revised function that measures how our model improves is:

 Where ‘eta’ describes a learning rate taken from our new model’s parameters. If we define a set of parameters as ‘theta’, then our generalized Deep Learning objective function is:

 If a single global maxima for a gradient exists, then this value of theta also solves

What is Reinforcement Deep Learning?

  Reinforcement Deep Learning (RDL) distinguishes itself from Supervised Deep Learning and Unsupervised Deep Learning in that an RDL model is not given labelled data nor is it trained on a series of unlabelled data; Instead, it recursively improves itself using reward & penalty functions based on its own trial & error performance.

  From Morgan Jones work on generalizing the Bellman Equation in optimization schemes^2 (2019), we can generalize an RDL model by solving

  Consider (u in U) an approximate-value-fitting term at t, (x in X) is a performable action of the model at t, and any value of ‘s’ represents a fixed state-of-being at t. Bertsekas’s Rollout Algorithm^3 (2005) gives us a computationally fast and easy way to approximate an initial stage to build a generalized RDL model from:

1. Construct an Approximate-Value Function using a Rollout Algorithm:

  • Initialize a base policy using some valid value of U:
  • for every {u(t), x(t)} in [0, T]:
  • Where we recursively input:

2. Fit an RDL model using the Approximate-Value Function:

  for k in [0, …,T-1],

  Where u(k) = (u(0), …, u(T)) is an optimal input sequence and ‘phi(t)’ is a Reward/Penalty term used to guide the model into convergence.

Reinforcement Deep Learning vs. Deep Learning Written by Anatolie Chernyakhovsky


^1 Deep Neural Network Hyperparameter Optimization with Orthogonal Array Tuning. 05 December 2019; Xiaocong Chen, Lina Yao, Chang Ge, Manqing Dong. Published by the International Conference on Neural Information Processing. Retrieved from: (

^2 A generalization of Bellman’s equation with application to path planning, obstacle avoidance and invariant set estimation. November 4 2019; Morgan Jones, Matthew M. Peet. Published by the International Federation of Automatic Control. Retrieved from: (

^3 Dynamic Programming and Optimal Control; Volume 1, 3rd ed. 2005; Dimitri Bertsekas. Published by Athena Scientific Publishers.

Deep Reinforcement Learning By Accenture’s Chief Data Scientist