How is AI used in lending?
Let’s take a look at Lending Club!
The lender needs to fill in personal information on the platform. Based on this information, the platform will evaluate lenders and decide whether to give them the loans and the amount of them. We found the statistics data of lenders of this platform in the first quarter of 2017 from the Internet. With this data, we aim to build a model that learns from
historical data to predict whether to lend to new lenders, while maximizing the company’s profits. Matching this binary classification problem, I decided to build some SKlearn models with Logistic regression.
Description:
This report mainly contains two parts, the first of which is the pre-processing of the data. Since the real data is complex, we need to clean the data and filter some valuable predictors. The second part is about model fitting. I have established three models in total. The advantages and disadvantages of these models are explained in detail in the second part.
Principle:
We first define some types of results.
The first class is called False positive (0,1), which means that he will not pay back(marked as 0) but we predict to lend (marked as 1).
The second class is called True positive (1,1), which means that he will pay back(marked as 1) and we predict to lend (marked as 1).
The third class is called True negative (0,0), which means that he will not pay back(marked as 0) and we do not predict to lend (marked as 0).
The fourth class is called False negative (1,0), which means that he will pay back(marked as 1) and we do not predict to lend (marked as 0).
The first class indicates a bad loan,the company did not get the loan receivable and lost money. The second class indicates that the company gets loans receivable and makes profits. Company does not lend in the third class, which gains no profit and no loss. The fourth class indicates it has a potentially profitable opportunity but not seized, the accounts show no profit and no loss. In our model, we need the proportion of making money to be greater than the proportion of losses in order to ensure that the company is profitable. Thus, I define low fall-out = False positive / (False positive + True negative), it represents the proportion of losing money, the wrong prediction. We hope for a low fall-out as small as possible. Then, I define high recall,
… 1/18
5/15/22, 3:44 PM SKLearn in Loan analysis
In [66]:
which is equal to True positive / (True positive + False negative), it represents the proportion of earning money, the right prediction. We hope to recall as big as possible. We want to use these two indicators to measure the profitability of the company and how good the model is.
Part1. Exploratory data analysis(data preprocessing) with predictors selection
import pandas as pd
loans_2017 = pd.read_csv(‘LoanStats_2017Q1.csv’, skiprows=1)
half_data = len(loans_2017) / 2
loans_2017 = loans_2017.dropna(thresh=half_data, axis=1)
loans_2017.drop_duplicates()
print(loans_2017.iloc[0])
print(loans_2017.shape[1])
funded_amnt_inv 3600
term 36 months
int_rate 7.49%
installment 111.97
grade A
sub_grade A4
emp_title Code/Compliance Inspector
emp_length 10+ years
home_ownership MORTGAGE
annual_inc 120000
verification_status Not Verified
issue_d Mar-2017
loan_status Issued
pymnt_plan n
purpose other
title Other
zip_code 467xx
addr_state IN
dti 18.9
delinq_2yrs 0
#Import the pandas and load the data set.
#Decrease the data size, drop some Na columns and some duplicates.
#Show the first row of the data and count the number of the columns.
localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pro… 2/18
5/15/22, 3:44 PM SKLearn in Loan analysis
In [52]: In [53]:
loans_2017 = loans_2017[[“loan_amnt”,”term”,”int_rate”,”installment”,”emp_length”,”home_o print(loans_2017.iloc[0])
print(loans_2017.shape[1])
loan_amnt 3600
term 36 months
int_rate 7.49%
installment 111.97
emp_length 10+ years
home_ownership MORTGAGE
annual_inc 120000
verification_status Not Verified
loan_status Issued
pymnt_plan n
purpose other
title Other
addr_state IN
dti 18.9
delinq_2yrs 0
earliest_cr_line Aug-1992
inq_last_6mths 1
open_acc 18
pub_rec 1
l b l 5658
#Based on the data dictionary and common sense, select some potential predictors. #Remove some variables with high correlation to others, since they almost represent the same feature. #Remove some variables obtained after prediction, such as “funded_amnt_inv”.
#Remove some variables that have no relationship with the prediction, such as “zip_code”.
print(loans_2017[‘loan_status’].value_counts())
Current 78897
Issued 15071
Fully Paid 2251
In Grace Period 330
Late (31-120 days) 126
Late (16-30 days) 104
Name: loan_status, dtype: int64
#Show the status of the loan in order to make binary labels.
#Count the number of occurrences of the value for each category
localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pro… 3/18
5/15/22, 3:44 PM SKLearn in Loan analysis
In [54]: In [55]:
loans_2017 = loans_2017[(loans_2017[‘loan_status’] == “Fully Paid”) | (loans_2017[‘loan_s
status_replace = {
“loan_status”: {
“Fully Paid”: 1,
“Late (31-120 days)”: 0,
“In Grace Period”: 0,
“Late (16-30 days)”: 0,
}
}
loans_2017 = loans_2017.replace(status_replace)
#Use “Fully Paid” as our label that represents to lend, which is marked as 1.
#Use “Late (31-120 days)”, “In Grace Period”,”Late (16-30 days)” as our label to refuse to lend, which is marked as 0.
orig_columns = loans_2017.columns
drop_columns = []
for col in orig_columns:
col_series = loans_2017[col].dropna().unique()
if len(col_series) == 1:
drop_columns.append(col)
loans_2017 = loans_2017.drop(drop_columns, axis=1)
print(drop_columns)
print(loans_2017.shape)
loans = loans_2017
[‘pymnt_plan’, ‘policy_code’]
(2811, 23)
#Drop some columns that contain the same value in each row.
localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pro… 4/18
5/15/22, 3:44 PM SKLearn in Loan analysis
In [56]:
print(loans)
_
In [57]:
96351 73000.0 Verified 0 debt_consolidation 96353 37000.0 Verified 1 debt_consolidation 96390 50000.0 Verified 1 debt_consolidation 96393 121000.0 Not Verified 0 debt_consolidation 96415 48000.0 Not Verified 1 other 96459 40000.0 Verified 0 debt_consolidation 96484 60000.0 Source Verified 1 debt_consolidation 96498 125000.0 Not Verified 1 small_business 96500 92000.0 Verified 0 debt_consolidation 96502 80000.0 Source Verified 1 debt_consolidation 96504 38560.0 Not Verified 1 home_improvement
96508 285000.0 Source Verified 1 credit_card 96523 85000.0 Source Verified 1 major_purchase 96538 200000.0 Source Verified 1 credit_card 96553 50000.0 Not Verified 1 debt_consolidation 96555 81000.0 Verified 1 moving 96573 75000.0 Source Verified 1 credit_card 96635 36000.0 Not Verified 1 debt_consolidation 96671 90000.0 Source Verified 1 debt_consolidation
#Check the cleaned dataframe
null_counts = loans.isnull().sum()
print(null_counts)
emp_length 166
home_ownership 0
annual_inc 0
verification_status 0
loan_status 0
purpose 0
title 0
addr_state 0
dti 0
delinq_2yrs 0
earliest_cr_line 0
inq_last_6mths 0
open_acc 0
pub_rec 0
revol_bal 0
revol_util 1
total_acc 0
last_credit_pull_d 0
pub_rec_bankruptcies 0
dtype: int64
#Show some missing data and decide if you need to remove these samples or columns. We find there is 166 data loss in the column “emp_length” and 1 data loss in the column “revol_util”.The size of the loss is not large, and this feature is important to model, so we do not drop these columns.
localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pro… 5/18
5/15/22, 3:44 PM SKLearn in Loan analysis
In [58]: In [59]:
loans = loans.dropna(axis=0)
print(loans.dtypes.value_counts())
float64 11
object 11
int64 1
dtype: int64
#Drop some NAs along the rows, show the data types of the dataframe.
object_colums_df = loans.select_dtypes(include=[“object”])
print(object_colums_df.iloc[0])
term 36 months
int_rate 14.99%
emp_length 3 years
home_ownership OWN
verification_status Source Verified
purpose home_improvement
title Home improvement
addr_state AZ
earliest_cr_line Sep-2007
revol_util 9.2%
last_credit_pull_d Apr-2017
Name: 124, dtype: object
#Select the columns with categorical variables.
#We need to map some selected terms to numeric variables later, since SKlearn only accepts numeric variables.
localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pro… 6/18
5/15/22, 3:44 PM SKLearn in Loan analysis
In [60]:
columns = [‘term’,’emp_length’,’home_ownership’,’verification_status’,’addr_state’] for i in column:
print(loans[i].value_counts())
print(loans[“purpose”].value_counts())
print(loans[“title”].value_counts())
36 months 2011
60 months 633
Name: term, dtype: int64
10+ years 1033
2 years 275
3 years 215
< 1 year 185
1 year 182
5 years 170
4 years 163
8 years 115
9 years 107
6 years 105
7 years 94
Name: emp_length, dtype: int64
MORTGAGE 1287
RENT 970
OWN 373
ANY 14
Name: home_ownership, dtype: int64
Source Verified 1154
Not Verified 756
Verified 734
Name: verification_status, dtype: int64
CA 418
TX 243
NY 209
FL 173
NJ 112
IL 106
NC 89
GA 83
MD 78
OH 72
PA 71
VA 71
WA 65
CO 62
MI 53
NV 52
MA 52
AZ 48
MN 44
MO 42
WI 40
CT 38
IN 34
TN 33
SC 33
AL 31
LA 31
localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pro… 7/18
5/15/22, 3:44 PM SKLearn in Loan analysis
KY 24
OK 24
OR 22
NM 21
UT 21
NH 20
HI 18
AR 18
KS 14
NE 13
DE 10
RI 10
MS 8
ND 7
MT 5
SD 5
ME 5
AK 5
WY 4
VT 3
DC 2
ID 2
Name: addr_state, dtype: int64
debt_consolidation 1379
credit_card 467
other 251
home_improvement 213
major_purchase 82
car 53
medical 45
small_business 41
house 40
moving 39
vacation 32
renewable_energy 2
Name: purpose, dtype: int64
Debt consolidation 1378
Credit card refinancing 468
Other 251
Home improvement 213
Major purchase 82
Car financing 53
Medical expenses 45
Business 41
Home buying 40
Moving and relocation 39
Vacation 32
Green loan 2
Name: title, dtype: int64
#Show the counts of the value in different categories of selected columns. Column “purpose” and Column “title” almost feature the same item, therefore, I decided to drop the “title” column.
localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pro… 8/18
5/15/22, 3:44 PM SKLearn in Loan analysis
In [61]:
In [62]: In [63]:
mapping_replace = {
“emp_length”: {
“n/a”: 0,
“< 1 year”: 0,
“1 year”: 1,
“2 years”: 2,
“3 years”: 3,
“4 years”: 4,
“5 years”: 5,
“6 years”: 6,
“7 years”: 7,
“8 years”: 8,
“9 years”: 9,
“10+ years”: 10,
}
}
loans = loans.drop([“last_credit_pull_d”,”earliest_cr_line”,”addr_state”,”title”,], axis= loans[“int_rate”] = loans[“int_rate”].str.rstrip(“%”).astype(“float”) loans[“revol_util”] = loans[“revol_util”].str.rstrip(“%”).astype(“float”) loans = loans.replace(mapping_replace)
#remove some useless columns and remove the percent sign(%) in some variables #convert categorical variable to numerical variables in column: “emp_length”
mapping_replace2 = {
“term”: {
” 36 months”: 1,
” 60 months”: 2,
}
}
loans = loans.replace(mapping_replace2)
#convert categorical variable to numerical variables in column: “term”
mapping_replace3 = {
“home_ownership”: {
“MORTGAGE”: 1,
“RENT”: 2,
“OWN”: 3,
“ANY”: 4,
}
}
loans = loans.replace(mapping_replace3)
#convert categorical variable to numerical variables in column: “home_ownership”
localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pro… 9/18
5/15/22, 3:44 PM SKLearn in Loan analysis
In [64]: In [782]:
mapping_replace4 = {
“verification_status”: {
“Source Verified”: 1,
“Not Verified”: 0,
“Verified”: 2,
}
}
loans = loans.replace(mapping_replace4)
#convert categorical variable to numerical variables in column: “verification_status”
mapping_replace5 = {
“purpose”: {
“debt_consolidation”: 1,
“credit_card”: 2,
“other”: 3,
“home_improvement”: 4,
“major_purchase”: 5,
“car”: 6,
“medical”: 7,
“small_business”: 8,
“house”: 9,
“moving”: 10,
“vacation”: 11,
“renewable_energy”: 12,
}
}
loans = loans.replace(mapping_replace5)
#convert categorical variable to numerical variables in column: “purpose”
localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pr… 10/18
5/15/22, 3:44 PM SKLearn in Loan analysis
In [65]:
In [784]:
In [15]: In [17]:
print(loans.info())
<class ‘pandas.core.frame.DataFrame’>
Int64Index: 2644 entries, 124 to 96742
Data columns (total 19 columns):
loan_amnt 2644 non-null float64
term 2644 non-null int64
int_rate 2644 non-null float64
installment 2644 non-null float64
emp_length 2644 non-null int64
home_ownership 2644 non-null int64
annual_inc 2644 non-null float64
verification_status 2644 non-null int64
loan_status 2644 non-null int64
purpose 2644 non-null object
dti 2644 non-null float64
delinq_2yrs 2644 non-null float64
inq_last_6mths 2644 non-null float64
open_acc 2644 non-null float64
pub_rec 2644 non-null float64
revol_bal 2644 non-null float64
revol_util 2644 non-null float64
total_acc 2644 non-null float64
pub_rec_bankruptcies 2644 non-null float64
dtypes: float64(13), int64(5), object(1)
memory usage: 413.1+ KB
None
#Now these predictors are converted into numerical variables, so that we can use SKlearn to build machine learning models
loans.to_csv(‘cleaned_loans2017.csv’, index=False)
#Save cleaned data in a new csv document
Part2. Model fitting and Analysis
import pandas as pd
loans = pd.read_csv(“cleaned_loans2017.csv”)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
cols = loans.columns
train_cols = cols.drop(“loan_status”)
features = loans[train_cols]
target = loans[“loan_status”]
lr.fit(features, target)
predictions = lr.predict(features)
#First, use logistic regression to test the performance of the model. Logistic regression is a method for dealing with two-class problems. Our current problem is a two-class problem. We have to decide whether to
… 11/18
5/15/22, 3:44 PM SKLearn in Loan analysis
lend them money. Note that we are doing classification with logistic regression, not regression. We prepared a training set, testing set, and labels, then we can train the model.
… 12/18
5/15/22, 3:44 PM SKLearn in Loan analysis
In [18]:
### Model 1
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict, KFold lr = LogisticRegression()
Kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(lr, features, target, cv=Kf) predictions = pd.Series(predictions)
# 4 Types results:
false_positives_filter = (predictions == 1) & (loans[“loan_status”] == 0) fp_counts = len(predictions[false_positives_filter])
true_positives_filter = (predictions == 1) & (loans[“loan_status”] == 1) tp_counts = len(predictions[true_positives_filter])
false_negative_filter = (predictions == 0) & (loans[“loan_status”] == 1) fn_counts = len(predictions[false_negative_filter])
true_negative_filter = (predictions == 0) & (loans[“loan_status”] == 0) tn_counts = len(predictions[true_negative_filter])
high_recall = tp_counts / float((fn_counts+tp_counts))
low_fallout = fp_counts / float((tn_counts+fp_counts))
print(“true positives rate =”,high_recall)
print(“false positives rate =”,low_fallout)
print (predictions[:20])
# high_recall = true positives rate
# low_fallout = false positives rate
### not good
true positives rate = 1.0
false positives rate = 1.0
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 1
14 1
15 1
16 1
17 1
18 1
localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pr… 13/18
5/15/22, 3:44 PM SKLearn in Loan analysis
19 1
dtype: int64
#Using the SKlearn cross-validation library to make predictions, we have projections called predictions. Next we compare the predicted value with the actual value, and then calculate true positives rate and false positives rate. We get a true positive rate of 100%. This value is high means these people can pay back the money, and we gave them a loan. Therefore, we can gain a lot from these loans, which is what we want to see.
However, the false positives rate is also very high(100%), this is what we do not want to see, since it means that these people will not pay back, but we still lent them all the money. Furthermore, this will cause bad loans and we will lose money. In addition, these two values are both 100%, it shows that the problem with this model is that borrows money no matter who he is. It is meaningless, so this model cannot do the prediction.
One of the reasons for this problem may be sample imbalance. It is likely that the vast majority of people in our data sample are marked with a 1 and very few are marked with a 0. This may lead to errors in the classifier, predicting all categories as 1.
We next improve the model and provide some solutions
How is AI used in lending?
localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pr… 14/18
5/15/22, 3:44 PM SKLearn in Loan analysis
In [19]:
### Model 2
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict
lr = LogisticRegression(class_weight=“balanced”)
Kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(lr, features, target, cv=Kf) predictions = pd.Series(predictions)
# 4 Types results:
false_positives_filter = (predictions == 1) & (loans[“loan_status”] == 0) fp_counts = len(predictions[false_positives_filter])
true_positives_filter = (predictions == 1) & (loans[“loan_status”] == 1) tp_counts = len(predictions[true_positives_filter])
false_negative_filter = (predictions == 0) & (loans[“loan_status”] == 1) fn_counts = len(predictions[false_negative_filter])
true_negative_filter = (predictions == 0) & (loans[“loan_status”] == 0) tn_counts = len(predictions[true_negative_filter])
high_recall = tp_counts / float((fn_counts+tp_counts))
low_fallout = fp_counts / float((tn_counts+fp_counts))
print(“true positives rate =”,high_recall)
print(“false positives rate =”,low_fallout)
print (predictions[:20])
# high_recall = true positives rate
# low_fallout = false positives rate
### better
true positives rate = 0.6221804511278195
false positives rate = 0.43992248062015504
0 1
1 0
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 0
10 1
11 1
12 0
13 1
14 0
15 1
16 1
17 1
18 1
localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pr… 15/18
5/15/22, 3:44 PM SKLearn in Loan analysis
19 1
dtype: int64
#One way to improve this model is to assign weights to the samples (with Penalties). Increase the weight of negative samples thus weakening the influence of positive samples on the model and increasing the influence of negative samples on the model. We change the default parameter for logistic regression to class_weight=”balanced”. class_weight parameters can help adjust the weight of positive samples and negative samples. Therefore, the true positives rate now is about 62.2%, and the false positives rate now is about 44%. The model has been improved. It has an effect on predictions, but the true positives rate of 62.2% is still relatively low. In order to maximize profits, we want a higher true positives rate and lower false positives rate。
localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pr… 16/18
5/15/22, 3:44 PM SKLearn in Loan analysis
In [48]:
### Model 3 : Customized weights
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict
Customized_weights = {
0: 4,
1: 1,
}
lr = LogisticRegression(class_weight= Customized_weights) Kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(lr, features, target, cv=Kf) predictions = pd.Series(predictions)
# 4 Types results:
false_positives_filter = (predictions == 1) & (loans[“loan_status”] == 0) fp_counts = len(predictions[false_positives_filter])
true_positives_filter = (predictions == 1) & (loans[“loan_status”] == 1) tp_counts = len(predictions[true_positives_filter])
false_negative_filter = (predictions == 0) & (loans[“loan_status”] == 1) fn_counts = len(predictions[false_negative_filter])
true_negative_filter = (predictions == 0) & (loans[“loan_status”] == 0) tn_counts = len(predictions[true_negative_filter])
high_recall = tp_counts / float((fn_counts+tp_counts))
low_fallout = fp_counts / float((tn_counts+fp_counts))
print(“true positives rate =”,high_recall)
print(“false positives rate =”,low_fallout)
print (predictions[:20])
# high_recall = true positives rate
# low_fallout = false positives rate
true positives rate = 0.6282894736842105
false positives rate = 0.4689922480620155
0 1
1 0
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 0
10 1
11 1
12 0
13 1
14 0
localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pr… 17/18
5/15/22, 3:44 PM SKLearn in Loan analysis
15 1
16 1
17 1
18 1
19 1
dtype: int64
#We can also manually adjust the sample weights as needed, which is very flexible (manual penalties). The proper weights can achieve better results and the best model can be found according to different loan preferences. For example, I assign a positive sample : negative sample = 1:4, then the true positives rate is about 63%, and the false positives rate is about 47%. True positives rate represents profit, false positives rate represents loss. The specific effect we want to achieve depends on the company’s requirements.