Machine Learning Automation Of Loan Performance Prediction

Machine Learning Automation Of Loan Performance Prediction

Machine Learning Automation Of Loan Performance Prediction Can we predict loan performance with machine learning automation? Many banks around the world have lowered their headcount. As a result, pools of outstanding loans have fewer bank employees overseeing them. Furthermore, the banks must come up with an alternative to monitor their potentially bad loans. The mortgage market is a gigantic environment that has good data to process.

Nasdaq Head of Ai, JPM Head of Ai, Point72 Head of Data Solutions Discuss Ai with Rebellion Research CEO Alexander Fleiss

Based on the 200,000 observations with 35 features, NMONTHS the number of months until the mortgage is taken off the books and FOR CLOSED a boolean variable that indicates whether the mortgage foreclosed are predicted for the test dataset.

Hudson River Trading’s Head of Risk Management, Fmr Citadel & Millenium Head of Risk

The project separates into three main sections: Data Preprocessing, Regression for NMONTHS and classification for FORECLOSED. The Regression section includes three models: OLS Linear Regression, Ridge Regression and Random Forest. The classification section includes the model logistic regression.

Data Preprocessing 

The original training set has 35 features. They separate into numerical and categorical variables. Categorical: CHNL, SELLER, ORIGDATE, FSTPAY, MATDT, FIRST FLAG, PURPOSE, PROP, NO_UNITS, OCCSTAT, STATE, MSA, ZIP, IO, MODFLAG, MITYPE, RELMORTGIND, ACTPER_MO 


It is worth noticing that NO_UNIT, MSA, ZIP, ACTPER_MO, were numerical when given. Because the number of units is a type of house. As a result, we treat it as categorical. After comparing the regression result. We decide to treat it as a categorical variable. MSA and ZIP are codes representing areas. Therefore treated as categorical variables. Similarly, ACTPER_MO only has three different values. That represent months. And also treated as categorical to provide more information. 

2.1 Missing Values 

The variables with more than 20 percent data missing drop. For categorical variables. Missing values are filled as UNKNOWN. For numerical variables, missing values are filled by median. 

2.2 High Correlation Variables 

The correlation matrix is calculated for all numerical variables. The heat map Figure1 shows that ORIGTERM, REMMNTHS, ADJRMTHS, MATDT_MO have strong correlation. Therefore we only keep ORIGTERM. ORIGDATE_MO and FSTPAY_MO have a high correlation. Only FSTPAY_MO is kept. 

Figure 1: Correlation Matrix 

2.3 More Data Exploration 

The remaining are 21 features with 9 of them being numerical and the rest being categorical. More Data Exploration will be under Regression and Classification sections. Due to the different goals, the variables will be handled differently. 

2.4 Validation Set 

For each model randomly set aside 1/3 observations for validation purpose and is used to estimate Mean Absolute Error for NMONTHS prediction and TPR as well as FPR for FORECLOSED classification. 

3 Regression for NMONTHS 

The OLS Linear Regression and Ridge Regression both look for the linear relationship between response and predictors. The advantage of Ridge Regression is that it adds a penalty coefficient α to penalize huge coefficients. Even though the OLS estimator is unbiased with the minimal mean square error, Ridge estimator, as a biased estimator, can always reach a lower error than the OLS estimator by tuning α. In other terms, Ridge Regression tends to give a more stable prediction, because it can choose the optimal out of models with similar MSE. To catch the non-linear relationship between the response and predictors. We include RandomForest. However, the validation results from those three models do not have a significant difference. In addition, considering the time efficiency we choose Ridge Regression to predict NMONTHS in the end. 

Machine Learning Automation Of Loan Performance Prediction Written by Yiyi Xu 

3.1 More Data Transformation 

The histogram Figure 2a presents the distribution of NMONTHS. It’s obvious that this dataset has a heavy tail. Therefore, by log transformation the data becomes distributed in Figure 2b. Which is more close to a normal distribution. Then, transformed NMONTHS we plot against the rest 9 numerical variables respectively. To get insights on the importance of features. As shown in Figure 3, LOANAGE, NUMBO, DLQSTATUS as integer variables do not have strong trends. Therefore, we transform them to categorical variables to provide more information. In fact this transformation did improve all models. 

(a) Original Distribution (b) After Log Transformation 

Figure 2: NMONTHS Distribution 

3.2 Linear Regression(OLS) 

The first attempt used 9 remaining numerical variables and the other 12 categorical variables. We remove the following variables from the model by p-value: ORIGDATE_MO, FSTPAY_MO, NUMBO, DLQSTATUS, ORIGTERM, MATDT_MO. Moreover, this model returns an Mean Absolute Error(MSE) around 0.45173 for log transformed data, and 13.7957 for original data which is 46.7744% of the mean. When added transformed data. The MAE improves to 13.5543. The prediction result we plot as True Response against Predicted Response. Figure 4 In addition, when plotted against Mean grouped by a variable, the square root of ORIGUPB, Natural Log of OLTV and DTI shows more linearity. 

3.3 Ridge Regression 

Essentially, Ridge Regression is an improved version of OLS. Therefore, the same predictors we will adopt. By tuning the penalty coefficient α. Figure 8 shows that the minimal MAE 13.553 is obtained when α is 14.213 

An interesting phenomenon here is that theoretically Ridge Regression, unlike OLS, is NOT scale invariant. However, comparing the model before and after standardized, before performs better. 

Machine Learning Automation Of Loan Performance Prediction Written by Yiyi Xu 

Figure 3: NMONTHS vs. Numerical Variables 

Figure 4: OLS True vs. Predicted 

(a) Original (b) After Square Root Transformation 

Figure 5: ORIGUPB 

(a) Original (b) After Log Transformation 

Figure 6: OLTV 

3.4 Random Forest 

As an ensemble of decision trees, Random Forest is robust for non-linear relationships. The variables are standardized before modeling. This ensures the coefficients are not affected by variables’ scales. Moreover, the first attempt was to use all numerical and categorical data. By tuning the number of trees, the minimal MAE obtained when using 30 trees with an MAE of 13.9. Ranking the features by importance. Lastly, choosing top important features to fit RandomForest, the MAE was not improved. 

Another reason for not choosing RandomForest is that when predicting by RandomForest, the predictions will never exceed it in the training set. Since our data has a heavy tail. Random Forest is not adopted. 

4 Classification for FORECLOSED 

The training set has severe unbalanced classes. Only 2309 out of 200000 observations from the training set have FORCLOSED=True. By Bootstrapping, the True class is almost equal to the False class. When fed into logistic regression, those samples lead to a more accurate model. 

(a) Original (b) After Log Transformation 

Figure 7: DTI 

Figure 8: MAD vs. Alpha 

4.1 Bootstrap Resampling 

In conclusion, true class is insufficient for directly training models based on the original training set. Before Bootstrapping. In addition, 1/3 of the original training set is randomly set aside. This is to ensure that the validation set has a similar distribution to the real test set. Moreover, the rest becomes the new training set. In addition, picking out all (roughly 1300) True class observations in the new training set and randomly drawing 120000 samples from them with replacement balanced the two classes. 

(a) Logistic-ROC Curve 

(b) Confusion Matrix 

Figure 9: Logistic Regression 

5 Prediction and Estimation 



The estimated TPR is 0.89, FPR IS 0.51

Machine Learning Automation Of Loan Performance Prediction Written by Yiyi Xu