Machine Learning Challenges

Machine Learning Challenges

Machine Learning Challenges : Machine learning is a combination of computer science, mathematics and statistics that could use systematic programming to automatically learn from data and conclude relationships between data.

Although machine learning is very popular these days in the financial market, it also meets many challenges when we apply machine learning techniques to financial data. 

From my knowledge, I think the most challenging part is that the financial data is very hard  to handle. 

First, many financial data might have missing values, extra data points or outliers. For example, some data might have values on holidays, which is not possible. So it is  necessary to conduct data cleaning before we continue processing our analysis. Different data sets have different data cleaning issues which increase the time cost for analysis. 

Second, the scale of  data might be very different which will also affect the result of the algorithm, so sometimes normalizing data is also necessary. 

Third, financial data is non stationary, which means that the data itself will contain many noises and the data will be less informative. 

Based on my own experience, the WTI Crude oil futures price is quite noisy, and the noises will definitely increase the model variance. Even if we are in a data rich situation or we have some techniques to reduce the noises, the noises inside the data will still affect the machine learning algorithm performance.  

Besides the above three challenges with data, there are other issues with data I have met before. When dealing with linear regression, the explanatory variable might have collinearity, although we could apply PCA to reduce dimension, the resulting variables lose intuition. 

When dealing  with time series data, it is very likely that the data we have is lagged, thus making it impossible to generate one step prediction that could be used in a trading signal. It is also possible that we  meet the situation of lack of data, or the data we use is too bad to do analysis. 

The previously mentioned problems with data also increase the challenge for applying machine learning algorithms to financial data. 

 Machine learning algorithms are designed to achieve specific goals. 

Each algorithm will have their own advantage and disadvantage. But the key is always the bias-variance tradeoff.

If we have a simple model, we might have high bias and low variance, which is likely to be an underfitting and we might miss some relevant relationship between predictors and explained variables. 

If we have an overly complex model, such as a 1000 degree polynomial regression, we  will have low bias and high variance, which will be an overfitting, and we might have almost zero training error but extremely high testing error.

Thus, it is important to achieve a balance  between bias and variance. 

Different models might have different bias and variance on the same data set. 

One challenge raised is which model might be better to use? 

There are many techniques to  evaluate models or models checking, cross-validation might be a good way to estimate prediction error, thus helping us decide which model is better to use. 

If we are dealing with a  time series model and the parameters numbers are different, we could also check log likelihood and AIC, or apply some statistical test to check which model might be better. 

For example, I did a project that asked me to estimate parameters of several GARCH type models to predict the  volatility, and I used Schwartz Bayesian Criteria to compare likelihood to choose for the best  models. 

In practice, many people tend to use simple models, such as linear models, in order to  make sure we do not overfit and we do not lose intuition when dealing analysis. 

Linear models are also the key step in other fields. For example, when I did the active portfolio management  project and I wanted to compute a signal called residual reversion signal, I applied linear regression and obtained the residuals from the regression to design such signals. 

However, the challenges for linear models are that linear models are generally very weak. In order to achieve a more precise prediction using linear models or other simple models, we might need other techniques. 

There are two techniques I know, one is bagging, and the other one is boosting. For  bagging, basically we use bootstrap to get a lot of training sets, say we get k training sets, and we build k models then we average the result of the k models to get our final result. 

Bagging can reduce variance and avoid overfitting. 

Random forest is an advanced version of bagging but it might lead to high bias since some predictors might be too strong. 

Boosting is like a weighted version of bagging, but the difference is that we use only one training set instead of k bootstrap  sets, and the regressor/classifier will be sequentially generated. Boosting can reduce bias but it  might lead to overfit. 

 In conclusion, I think the challenges for using machine learning for investment management  could be classified into two classes, one is the problem with financial data, another is the challenge  with algorithms. They will also influence each other.  

Machine Learning Challenges

Written by Yifan Wei & Edited by Calvin Ma & Alexander Fleiss

Natural Language Processing In Finance

5 Big Myths of AI and Machine Learning Debunked