Performance of Recurrent Neural Networks Following the COVID Market Shock
Performance of Recurrent Neural Networks Following the COVID Market Shock This paper uses the market shock of the COVID-19 pandemic to further understand how Long Short Term Memory (LSTM) Recurrent Neural Networks (RNNs) adapt to varying degrees of volatility. Prior to the market shock, the trend of the S&P 500 was positive, with occasional market corrections.
However, the pandemic induced a large negative market force. This was followed by a mean-reverting positive trend, and in the span of just over 6 months, the S&P 500 had covered all losses. This paper tests the LSTM model against this data set of quickly changing market trends.
The market shock of the COVID-19 pandemic was a large negative force, as can be seen in the following chart of the SPDR S&P 500 ETF ($SPY).
This is very interesting, especially in the domain of predictive time series modeling. Perhaps some questions that arise when looking at this stock chart are: “How can you use previous stock data, which has relatively low volatility to react to such a drastic market shock?” and “How quickly can a predictive model ’learn’ and ’forget’ these market shocks?” In this paper, I will investigate the performance of an LSTM network during the COVID-19 pandemic.
For brevity’s sake, I will not discuss the theory of feed-forward neural networks, recurrent neural networks (RNNs), or long short term memory (LSTM) networks.
A common technique used in modern neural networks is the technique of dropout layers. Neural networks, provided sufficient data and apt computational time (a unit of which is an epoch), can recognize and learn patterns in a data set; however, they can also overfit the data, meaning that they lose the ability to generalize these patterns to new time series. This feature is essential, as future data is unknown.
To counteract this, dropout layers simulate many different network architectures over the course of training. They do this by randomly temporarily dropping out nodes during training, along with all its incoming and outgoing connections. This causes the training process to be noisy, ensuring that neurons within a layer account can adapt to a wide range of inputs.
From the network’s perspective, these dropout layers induce a random sparse activation, so that the network can better “understand” the importance of the activation of a specific neuron in the previous layer. This makes the model more robust, and greatly reduces overfitting.
A dense layer is the most standard layer in a neural network. Each neuron in a dense layer connectsed to each neuron in the previous layer, and also is connected to each neuron in the next layer.
Note that this does not mean that each neuron in the previous layer has an impact on activation of the neuron, or that the neuron has an impact on the activation of each neuron in the next layer. This is because during training, the weight connects two particular neurons is allowed to go to zero, which has the effect of deleting the connection. However, this isn’t the same as actually deleting the connection, as is the case with non-dense layers.
Stochastic Gradient Descent
The heart of the training algorithm is gradient descent. However, due to the number of neurons in the network, this task can be very computationally intensive.
This is because for each epoch, each combination of training observation and neuron requires four gradients to be calculated. Even in relatively average sized networks like the model used in this paper, the training process can be very computationally intensive.
The remedy to this is to use stochastic gradient descent. Before training, we subdivide the training set into random batches. We scatter the batches such that a particular class of observation spreads (relatively) evenly throughout the batches. This ensures that training on an unlucky sequence of batches does not undo forward progress.
If the cost function is to be a surface in high-dimensional space (each weight and bias is a dimension), traditional gradient descent would be as if you meticulously pick the steepest angle to descend, and proceed down that angle.
Continuing with that analogy, stochastic gradient descent would be as if a drunken man were to walk downhill; the general direction is correct, but the progress may stumble around the correct path.
While this process increases the amount of epochs necessary for convergence (or at least up to an accuracy threshold), it considerably decreases the computation required per epoch. This results in much faster training.
The first part of any project is to procure the data. For this paper, the data was sourced from Yahoo Finance’s Historical Data service. The data that was used was the adjusted OHLC price action for SPDR S&P 500 ETF Trust ($SPY) from January 1, 2016 to January 1, 2021.
Note: In the project proposal, I proposed using January 1, 2020 to January 1, 2021 as the complete data set, with the first three quarters designated for the training data and the last quarter designated as the testing data. However, this amount of data was not sufficient, and in 100 epochs of training, no real convergence of loss function was occurring.
The data up until March 11, 2019 was used as the training set, and the remainder of the data was used as the testing data.
Before feeding this data into the neural network, the data underwent (0, 1)-MinMax scaling. This process squashes the data to fall in the interval of (0, 1).
We empoyed the following network architecture :
After each LSTM layer, a dropout layer was used to combat overfitting. Finally, after 4 LSTM-Dropout pairs, a dense layer was used to act as the output layer. This is to ensure that all activations from the last dropout layers are being taken into consideration (at least during training).
Note: The frameworks used to train and execute the neural networks were SKLearn and Keras. This computation is on Google CoLab, so that I could leverage Google TPUs for faster training.
The training was done for 100 epochs, each epoch being run on a randomized batch of training data (see Stochastic Gradient Descent).
The following graph is the actual stock price action compared to the predicted stock price action. Overall, the LSTM network performed better than expected. Since the training set had not seen these levels of volatility. The model was not expected to track the downwards price action this well. However, during the upwards trend following the crash, the LSTM network became dissociated with
the stock price action, and it did not recover from this dissociation; rather, the difference between the prediction and the actual price movement increased.
The mean squared error (MSE) of this LSTM model was 180.926. However, much of this error is a result of the second half of the training set; namely, the bulk of the error comes from the price action following the trough. This can be made evident by taking the MSE of the first half of the test data set. The MSE of the first half of the testing data set is only 22.749.
In conclusion, the performance of the LSTM model delighted. For a network as small as this, and the network architecture being chosen as arbitrarily as it was. This model performed well.
Now, while I cannot make blanket statements as to the capabilities of all LSTM models, I will make a few statements about the capabilities of this particular network architecture in the application of this time series.
It seems to be the case that following a market event. The model can lose connection with the actual price action. This is apparent during the few months before the crash. And the last few months of the test data, where we can see the gap between the actual price action and predicted price action increase. I think that in a production environment, this model would need a correcting agent. One that pushes the predicted price action closer to the current actual price action. I believe that this would decrease the gaps we see.
Another remark about this network is that it was very quick to adapt to the market crash. There was very little lag between the crash in the actual price action and the crash in the predicted price action.
While this is good behavior. There is more work that would have to be done to ensure that this is not a result of recency bias (similar to an Exponential Moving Average. Where newer data is preferential over older data, at an exponential rate). This ensures that the model is not too sensitive to market corrections that take the form of sharp mean-reverting jolts.
In conclusion, a good follow-up project to this. Would be to test this exact same data set with a Gated Recurrent Unit (GRU). GRUs are a class of RNNs that, similar to the LSTM architecture, remedy the vanishing gradient problem. The comparison of the performance of these two network types. Would expose the pros and cons of using one particular network over the other for highly volatile time series data.