Ai Financial : Constructing Equity Portfolios With Machine Learning
Ai Financial : Constructing Equity Portfolios With Machine Learning Our approach is to build a machine learning model based on the features from certain SEC filings. Specifically manager data or 13F data. We will use this data with our machine learning to predict the direction of a stock’s quarterly price movement. Moreover, a portfolio constructed of those stocks that have a positive prediction return to build our portfolio. Overall, our model constructed portfolio outperforms S&P 500 for an average of 12% in simulation.
Keywords: SEC 13F form, machine learning, quantitative model, simulation
We use 13F data from 2012 to 2018 since the data before 2012 has less effectiveness and 6-year 13F data is already enough for us to train a convicting model and implement simulation. Also we have the daily stock prices for each stock from 2012 to 2018 to calculate the quarterly return.
Our 13F data contains record ID of the form, CIK (Central Index Key) of the institution filing the form, quarter end date, whether is a restatement or amend, form filing date and the CUSIP number of the securities managers holding, the holding quantity and the fair market value of the securities listed along with the investment discretion of that long position.
Because the 13F is filed up to 45 days after the end of one quarter, we cannot get the whole 13F data of one quarter until 45 days after that quarter ends. So our quarterly balanced portfolio begins 45 days after the quarter. As the stock price movement is very time sensitive and we are already at least 45 days behind the ‘fresh data’, we decide to eliminate those data which is claimed as restatement and amend (those data typically are reported later than one quarter after the requested filing date).
This could cause significant issues. The amendments are corrections to the forms.
For each quarter, we aggregate the 13F during that quarter: sum the market value, quantity of holding and the investment discretion refer as longing fraction for each stock, then combining these data with the quarterly return begin 45 days after the quarter end date. Repeat these steps from the first quarter of 2012 to the last quarter of 2015, we get the data with the size about 10,000.
A glimpse of the data we use. (market value is divided by 1000 for the purpose of easy reading)
Supervised Machine Learning Strategy
Next is to fit a model on our dataset. Our target is to optimize our portfolio’s return, and our approach is to predict the return and choose those that have the highest expected return. We already have the target value: quarterly return for each of our data starts at 45 days after one quarter end, so a supervised machine learning model would be an appropriate choice.
In order to make a high accuracy prediction, we decide to split our target data (quarterly return) into two classes: negative return and positive return. Our objective is to predict the stock price movement after one quarter. Thus, we need to train a classification model.
It is always important to preprocess the data before we fit it into a model as it helps to accelerate our learning algorithms as well as training a higher accuracy model.
Expand the Features
For now, we have very limited estimators (aka features) to predict the price movement: QTY (holding quantity), market value and long fraction. In order to get more features to train our model, we need to expand the features we have for now. Polynomial expansion is a useful method to add complexity and more information to the model by getting features’ high-order and interaction terms. After several adjustments, we choose to expand our features to second order as it can give us more information and is not so computationally expensive.
Feature Normalization can speed up the learning algorithms a lot for our dataset because we have features with very different scales (holding quantity and market value can vary from 10 to 100,000,000 while long fraction usually no bigger than 1). Since there is no evidence shows that our features are normally distributed, feature scaling and mean normalization are useful to transform our features. By doing so, all our features are on a similar scale. Here is the transformation:
Model Selection and Evaluation
There are many classification models we can use on our problem: logistic regression, neural network classification, decision tree classifier along with support vector machine classification.
Cross-Validation Model Selection
In order to choose the best model, we split our data into three parts: training set, cross-validation set and test set and the proportion is 60%, 20% and 20% respectively. The reason we add a cross validation set is that we use this set to pre-test and based on the pre-test result for model selection and adjustment. Then we use the test set to make predictions so that our model is tested on a completely new set of data and the result would reflect the ‘true’ accuracy of our model’s prediction.
To evaluate models performance, we use f1 score as it reflects not only the predicted precision but also the recall. Confusion matrix is also considered. For each model, we determine our optimal hyperparameters using Grid Search.
Classification model’s output is actually the estimated probability of this data belonging to each of the classes. In our project, we have two classes (positive return and negative return) and the threshold probability default value is 0.5, which means if the estimated probability of positive return is greater than 0.5, then the model outputs positive. However, our goal is to choose up to 50 stocks to construct our portfolio and for each quarter, we have to choose among more than 7,000 stocks. We decided to adjust our prediction threshold to be greater than 0.5. In other
words, only those with higher probability to have a positive return will be predicted as positive. With this method, we can adjust our models to get better performance.
Taking all above along with computation complexity into account, our final model selection is a neural network classification model with one hidden layer which has 100 neurons, the initial probability threshold is set to 0.53. Our model has an f1 score of 0.70.
Constructing Portfolios and Simulation
After building the model, the final process is to apply our model to implement a simulation, predict the price movement, construct our portfolio and calculate the quarterly return. The data we use to simulate our strategy is from the first quarter of 2016 to the first quarter of 2018. Our portfolio has up to 50 stocks so if there are more than 50 stocks that have positive prediction return then we choose the top 50 with the highest probability. To better understand how our portfolio performs, we compare with the threshold S&P 500 at the same time. Here is our portfolio performance along with the S&P 500 quarterly return:
The result is that our quarterly balanced portfolio outperforms S&P 500 with an average of 12% quarterly return from 2016 to 2018.
In conclusion, the SEC 13F form provides the public a glimpse. A glimpse of what the top assets managers are holding in the last quarter. Using that information to develop a neural network classifier. We predict the quarterly return. And construct a portfolio based on the prediction will indeed make a considerable profit with respect to S&P 500.
However, there are some issues with our strategy and model. First, the 13F is filed up to 45 days after the end of one quarter. And most managers submit their 13F as late as possible. Furthermore, the stock market is very time sensitive, and we can only get the whole quarter’s 13F information after that quarter has ended 45 days, implying that we have to use the equity longing information that may have happened more than four months ago. Also, the 13F itself has many controversy. Studies has shown that there are “widespread presence of significant reporting errors”. In the 13F and many believed that it is unreliable.
In addition, 13F only concludes long position. And this incomplete picture may be misleading as short-selling plays an important role in asset management. Finally, our model’s f1 score is not quite high as it suffers from an underfitting problem. In conclusion, although we have expanded our features we extract from the 13F data. It is clear that these features are not enough for training a high performance model. The very limited information we get from 13F is another big issue which needs further notice. Introducing more features to the model rather than using only 13F data may be a promising research idea.
Ai Financial : Constructing Equity Portfolios With Machine Learning
Ai Financial : Constructing Equity Portfolios With Machine Learning Reference
Anne M. Anderson and Paul Brockman, 2016, Form 13f (Mis) Filings, Working paper, Lehigh University.
Aragon, George O, Michael Hertzel, and Zhen Shi, 2013, Why do hedge funds avoid disclosure?
Evidence from confidential 13f filings, Journal of Financial and Quantitative Analysis 48, 1499–1518.
Aragon, George O, and Vikram K Nanda, 2014. Strategic delays and clustering in hedge fund reported returns, Working paper, Arizona State University
Brown, Stephen J, and Christopher Schwarz, 2013, Do market participants care about portfolio disclosure? Evidence from hedge funds’ 13f filings, Working paper, New York University.
Collin-Dufresne, Pierre, and Vyacheslav Fos, 2015, Do prices reveal the presence of informed trading?, Journal of Finance 70, 1540–6216.
Edmans, Alex, 2009, Blockholder trading, market efficiency, and managerial myopia, The Journal of Finance 64, 2481–2513.
Lemke, Thomas P, and Gerald T Lins, 1987. Disclosure of equity holdings by institutional investment managers. An analysis of section 13 (f) of the securities exchange act of 1934, The Business Lawyer 91–119.
Susan E. K. Christoffersen, Erfan Danesh, and David Musto, 2016, Why Do Institutions Delay Reporting Their Shareholdings? Evidence from Form 13F, Working paper.
Verbeek, Marno, and Yu Wang, 2013, Better than the original? The relative success of copycat funds, Journal of Banking & Finance 37, 3454-3471.