# How is AI used in lending?

How is AI used in lending?

#### Let’s take a look at Lending Club!

The lender needs to fill in personal information on the platform. Based on this information, the platform will evaluate lenders and decide whether to give them the loans and the amount of them. We found the statistics data of lenders of this platform in the first quarter of 2017 from the Internet. With this data, we aim to build a model that learns from

historical data to predict whether to lend to new lenders, while maximizing the company’s profits. Matching this binary classification problem, I decided to build some SKlearn models with Logistic regression.

#### Description：

This report mainly contains two parts, the first of which is the pre-processing of the data. Since the real data is complex, we need to clean the data and filter some valuable predictors. The second part is about model fitting. I have established three models in total. The advantages and disadvantages of these models are explained in detail in the second part.

Principle:

We first define some types of results.

The first class is called False positive (0,1), which means that he will not pay back(marked as 0) but we predict to lend (marked as 1).

The second class is called True positive (1,1), which means that he will pay back(marked as 1) and we predict to lend (marked as 1).

The third class is called True negative (0,0), which means that he will not pay back(marked as 0) and we do not predict to lend (marked as 0).

The fourth class is called False negative (1,0), which means that he will pay back(marked as 1) and we do not predict to lend (marked as 0).

The first class indicates a bad loan,the company did not get the loan receivable and lost money. The second class indicates that the company gets loans receivable and makes profits. Company does not lend in the third class, which gains no profit and no loss. The fourth class indicates it has a potentially profitable opportunity but not seized, the accounts show no profit and no loss. In our model, we need the proportion of making money to be greater than the proportion of losses in order to ensure that the company is profitable. Thus, I define low fall-out = False positive / (False positive + True negative), it represents the proportion of losing money, the wrong prediction. We hope for a low fall-out as small as possible. Then, I define high recall,

… 1/18

5/15/22, 3:44 PM SKLearn in Loan analysis

In [66]:

which is equal to True positive / (True positive + False negative), it represents the proportion of earning money, the right prediction. We hope to recall as big as possible. We want to use these two indicators to measure the profitability of the company and how good the model is.

Part1. Exploratory data analysis(data preprocessing) with predictors selection

import pandas as pd

loans_2017 = pd.read_csv(‘LoanStats_2017Q1.csv’, skiprows=1)

half_data = len(loans_2017) /

loans_2017 = loans_2017.dropna(thresh=half_data, axis=1)

loans_2017.drop_duplicates()

print(loans_2017.iloc[0])

print(loans_2017.shape[1])

funded_amnt_inv 3600

term 36 months

int_rate 7.49%

installment 111.97

emp_title Code/Compliance Inspector

emp_length 10+ years

home_ownership MORTGAGE

annual_inc 120000

verification_status Not Verified

issue_d Mar-2017

loan_status Issued

pymnt_plan n

purpose other

title Other

zip_code 467xx

dti 18.9

delinq_2yrs 0

#Import the pandas and load the data set.

#Decrease the data size, drop some Na columns and some duplicates.

#Show the first row of the data and count the number of the columns.

localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pro… 2/18

5/15/22, 3:44 PM SKLearn in Loan analysis

In [52]: In [53]:

loans_2017 = loans_2017[[“loan_amnt”,”term”,”int_rate”,”installment”,”emp_length”,”home_o print(loans_2017.iloc[0])

print(loans_2017.shape[1])

loan_amnt 3600

term 36 months

int_rate 7.49%

installment 111.97

emp_length 10+ years

home_ownership MORTGAGE

annual_inc 120000

verification_status Not Verified

loan_status Issued

pymnt_plan n

purpose other

title Other

dti 18.9

delinq_2yrs 0

earliest_cr_line Aug-1992

inq_last_6mths 1

open_acc 18

pub_rec 1

l b l 5658

#Based on the data dictionary and common sense, select some potential predictors. #Remove some variables with high correlation to others, since they almost represent the same feature. #Remove some variables obtained after prediction, such as “funded_amnt_inv”.

#Remove some variables that have no relationship with the prediction, such as “zip_code”.

print(loans_2017[‘loan_status’].value_counts())

Current 78897

Issued 15071

Fully Paid 2251

In Grace Period 330

Late (31-120 days) 126

Late (16-30 days) 104

Name: loan_status, dtype: int64

#Show the status of the loan in order to make binary labels.

#Count the number of occurrences of the value for each category

localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pro… 3/18

5/15/22, 3:44 PM SKLearn in Loan analysis

In [54]: In [55]:

loans_2017 = loans_2017[(loans_2017[‘loan_status’] == “Fully Paid”) | (loans_2017[‘loan_s

status_replace =

“loan_status”: {

“Fully Paid”: 1,

“Late (31-120 days)”: 0,

“In Grace Period”: 0,

“Late (16-30 days)”: 0,

}

loans_2017 = loans_2017.replace(status_replace)

#Use “Fully Paid” as our label that represents to lend, which is marked as 1.

#Use “Late (31-120 days)”, “In Grace Period”,”Late (16-30 days)” as our label to refuse to lend, which is marked as 0.

orig_columns = loans_2017.columns

drop_columns = []

for col in orig_columns:

col_series = loans_2017[col].dropna().unique()

if len(col_series) == 1:

drop_columns.append(col)

loans_2017 = loans_2017.drop(drop_columns, axis=1)

print(drop_columns)

print(loans_2017.shape)

loans = loans_2017

[‘pymnt_plan’, ‘policy_code’]

(2811, 23)

#Drop some columns that contain the same value in each row.

localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pro… 4/18

5/15/22, 3:44 PM SKLearn in Loan analysis

In [56]:

print(loans)

In [57]:

96351 73000.0 Verified 0 debt_consolidation  96353 37000.0 Verified 1 debt_consolidation  96390 50000.0 Verified 1 debt_consolidation  96393 121000.0 Not Verified 0 debt_consolidation  96415 48000.0 Not Verified 1 other  96459 40000.0 Verified 0 debt_consolidation  96484 60000.0 Source Verified 1 debt_consolidation  96498 125000.0 Not Verified 1 small_business  96500 92000.0 Verified 0 debt_consolidation  96502 80000.0 Source Verified 1 debt_consolidation  96504 38560.0 Not Verified 1 home_improvement

96508 285000.0 Source Verified 1 credit_card  96523 85000.0 Source Verified 1 major_purchase  96538 200000.0 Source Verified 1 credit_card  96553 50000.0 Not Verified 1 debt_consolidation  96555 81000.0 Verified 1 moving  96573 75000.0 Source Verified 1 credit_card  96635 36000.0 Not Verified 1 debt_consolidation  96671 90000.0 Source Verified 1 debt_consolidation

#Check the cleaned dataframe

null_counts = loans.isnull().sum()

print(null_counts)

emp_length 166

home_ownership 0

annual_inc 0

verification_status 0

loan_status 0

purpose 0

title 0

dti 0

delinq_2yrs 0

earliest_cr_line 0

inq_last_6mths 0

open_acc 0

pub_rec 0

revol_bal 0

revol_util 1

total_acc 0

last_credit_pull_d 0

pub_rec_bankruptcies 0

dtype: int64

#Show some missing data and decide if you need to remove these samples or columns. We find there is 166 data loss in the column “emp_length” and 1 data loss in the column “revol_util”.The size of the loss is not large, and this feature is important to model, so we do not drop these columns.

localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pro… 5/18

5/15/22, 3:44 PM SKLearn in Loan analysis

In [58]: In [59]:

loans = loans.dropna(axis=0)

print(loans.dtypes.value_counts())

float64 11

object 11

int64 1

dtype: int64

#Drop some NAs along the rows, show the data types of the dataframe.

object_colums_df = loans.select_dtypes(include=[“object”])

print(object_colums_df.iloc[0])

term 36 months

int_rate 14.99%

emp_length 3 years

home_ownership OWN

verification_status Source Verified

purpose home_improvement

title Home improvement

earliest_cr_line Sep-2007

revol_util 9.2%

last_credit_pull_d Apr-2017

Name: 124, dtype: object

#Select the columns with categorical variables.

#We need to map some selected terms to numeric variables later, since SKlearn only accepts numeric variables.

localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pro… 6/18

5/15/22, 3:44 PM SKLearn in Loan analysis

In [60]:

columns = [‘term’,’emp_length’,’home_ownership’,’verification_status’,’addr_state’] for i in column:

print(loans[i].value_counts())

print(loans[“purpose”].value_counts())

print(loans[“title”].value_counts())

36 months 2011

60 months 633

Name: term, dtype: int64

10+ years 1033

2 years 275

3 years 215

< 1 year 185

1 year 182

5 years 170

4 years 163

8 years 115

9 years 107

6 years 105

7 years 94

Name: emp_length, dtype: int64

MORTGAGE 1287

RENT 970

OWN 373

ANY 14

Name: home_ownership, dtype: int64

Source Verified 1154

Not Verified 756

Verified 734

Name: verification_status, dtype: int64

CA 418

TX 243

NY 209

FL 173

NJ 112

IL 106

NC 89

GA 83

MD 78

OH 72

PA 71

VA 71

WA 65

CO 62

MI 53

NV 52

MA 52

AZ 48

MN 44

MO 42

WI 40

CT 38

IN 34

TN 33

SC 33

AL 31

LA 31

localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pro… 7/18

5/15/22, 3:44 PM SKLearn in Loan analysis

KY 24

OK 24

OR 22

NM 21

UT 21

NH 20

HI 18

AR 18

KS 14

NE 13

DE 10

RI 10

MS 8

ND 7

MT 5

SD 5

ME 5

AK 5

WY 4

VT 3

DC 2

ID 2

Name: addr_state, dtype: int64

debt_consolidation 1379

credit_card 467

other 251

home_improvement 213

major_purchase 82

car 53

medical 45

house 40

moving 39

vacation 32

renewable_energy 2

Name: purpose, dtype: int64

Debt consolidation 1378

Credit card refinancing 468

Other 251

Home improvement 213

Major purchase 82

Car financing 53

Medical expenses 45

Moving and relocation 39

Vacation 32

Green loan 2

Name: title, dtype: int64

#Show the counts of the value in different categories of selected columns. Column “purpose” and Column “title” almost feature the same item, therefore, I decided to drop the “title” column.

localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pro… 8/18

5/15/22, 3:44 PM SKLearn in Loan analysis

In [61]:

In [62]: In [63]:

mapping_replace =

“emp_length”: {

“n/a”: 0,

“< 1 year”: 0,

“1 year”: 1,

“2 years”: 2,

“3 years”: 3,

“4 years”: 4,

“5 years”: 5,

“6 years”: 6,

“7 years”: 7,

“8 years”: 8,

“9 years”: 9,

“10+ years”: 10,

}

loans = loans.drop([“last_credit_pull_d”,”earliest_cr_line”,”addr_state”,”title”,], axis= loans[“int_rate”] = loans[“int_rate”].str.rstrip(“%”).astype(“float”) loans[“revol_util”] = loans[“revol_util”].str.rstrip(“%”).astype(“float”) loans = loans.replace(mapping_replace)

#remove some useless columns and remove the percent sign(%) in some variables #convert categorical variable to numerical variables in column: “emp_length”

mapping_replace2 =

“term”: {

” 36 months”: 1,

” 60 months”: 2,

}

loans = loans.replace(mapping_replace2)

#convert categorical variable to numerical variables in column: “term”

mapping_replace3 =

“home_ownership”: {

“MORTGAGE”: 1,

“RENT”: 2,

“OWN”: 3,

“ANY”: 4,

}

loans = loans.replace(mapping_replace3)

#convert categorical variable to numerical variables in column: “home_ownership”

localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pro… 9/18

5/15/22, 3:44 PM SKLearn in Loan analysis

In [64]: In [782]:

mapping_replace4 =

“verification_status”: {

“Source Verified”: 1,

“Not Verified”: 0,

“Verified”: 2,

}

loans = loans.replace(mapping_replace4)

#convert categorical variable to numerical variables in column: “verification_status”

mapping_replace5 =

“purpose”: {

“debt_consolidation”: 1,

“credit_card”: 2,

“other”: 3,

“home_improvement”: 4,

“major_purchase”: 5,

“car”: 6,

“medical”: 7,

“house”: 9,

“moving”: 10,

“vacation”: 11,

“renewable_energy”: 12,

}

loans = loans.replace(mapping_replace5)

#convert categorical variable to numerical variables in column: “purpose”

localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pr… 10/18

5/15/22, 3:44 PM SKLearn in Loan analysis

In [65]:

In [784]:

In [15]: In [17]:

print(loans.info())

<class ‘pandas.core.frame.DataFrame’>

Int64Index: 2644 entries, 124 to 96742

Data columns (total 19 columns):

loan_amnt 2644 non-null float64

term 2644 non-null int64

int_rate 2644 non-null float64

installment 2644 non-null float64

emp_length 2644 non-null int64

home_ownership 2644 non-null int64

annual_inc 2644 non-null float64

verification_status 2644 non-null int64

loan_status 2644 non-null int64

purpose 2644 non-null object

dti 2644 non-null float64

delinq_2yrs 2644 non-null float64

inq_last_6mths 2644 non-null float64

open_acc 2644 non-null float64

pub_rec 2644 non-null float64

revol_bal 2644 non-null float64

revol_util 2644 non-null float64

total_acc 2644 non-null float64

pub_rec_bankruptcies 2644 non-null float64

dtypes: float64(13), int64(5), object(1)

memory usage: 413.1+ KB

None

#Now these predictors are converted into numerical variables, so that we can use SKlearn to build machine learning models

loans.to_csv(‘cleaned_loans2017.csv’, index=False

#Save cleaned data in a new csv document

Part2. Model fitting and Analysis

import pandas as pd

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

cols = loans.columns

train_cols = cols.drop(“loan_status”)

features = loans[train_cols]

target = loans[“loan_status”]

lr.fit(features, target)

predictions = lr.predict(features)

#First, use logistic regression to test the performance of the model. Logistic regression is a method for dealing with two-class problems. Our current problem is a two-class problem. We have to decide whether to

… 11/18

5/15/22, 3:44 PM SKLearn in Loan analysis

lend them money. Note that we are doing classification with logistic regression, not regression. We prepared a training set, testing set, and labels, then we can train the model.

… 12/18

5/15/22, 3:44 PM SKLearn in Loan analysis

In [18]:

### Model 1

from sklearn.linear_model import LogisticRegression

from sklearn.cross_validation import cross_val_predict, KFold lr = LogisticRegression()

Kf = KFold(features.shape[0], random_state=1)

predictions = cross_val_predict(lr, features, target, cv=Kf) predictions = pd.Series(predictions)

# 4 Types results:

false_positives_filter = (predictions == 1) & (loans[“loan_status”] == 0) fp_counts = len(predictions[false_positives_filter])

true_positives_filter = (predictions == 1) & (loans[“loan_status”] == 1) tp_counts = len(predictions[true_positives_filter])

false_negative_filter = (predictions == 0) & (loans[“loan_status”] == 1) fn_counts = len(predictions[false_negative_filter])

true_negative_filter = (predictions == 0) & (loans[“loan_status”] == 0) tn_counts = len(predictions[true_negative_filter])

high_recall = tp_counts / float((fn_counts+tp_counts))

low_fallout = fp_counts / float((tn_counts+fp_counts))

print(“true positives rate =”,high_recall)

print(“false positives rate =”,low_fallout)

print (predictions[:20])

# high_recall = true positives rate

# low_fallout = false positives rate

### not good

true positives rate = 1.0

false positives rate = 1.0

0 1

1 1

2 1

3 1

4 1

5 1

6 1

7 1

8 1

9 1

10 1

11 1

12 1

13 1

14 1

15 1

16 1

17 1

18 1

localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pr… 13/18

5/15/22, 3:44 PM SKLearn in Loan analysis

19 1

dtype: int64

#Using the SKlearn cross-validation library to make predictions, we have projections called predictions. Next we compare the predicted value with the actual value, and then calculate true positives rate and false positives rate. We get a true positive rate of 100%. This value is high means these people can pay back the money, and we gave them a loan. Therefore, we can gain a lot from these loans, which is what we want to see.

However, the false positives rate is also very high(100%), this is what we do not want to see, since it means that these people will not pay back, but we still lent them all the money. Furthermore, this will cause bad loans and we will lose money. In addition, these two values are both 100%, it shows that the problem with this model is that borrows money no matter who he is. It is meaningless, so this model cannot do the prediction.

One of the reasons for this problem may be sample imbalance. It is likely that the vast majority of people in our data sample are marked with a 1 and very few are marked with a 0. This may lead to errors in the classifier, predicting all categories as 1.

We next improve the model and provide some solutions

###### How is AI used in lending?

localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pr… 14/18

5/15/22, 3:44 PM SKLearn in Loan analysis

In [19]:

### Model 2

from sklearn.linear_model import LogisticRegression

from sklearn.cross_validation import cross_val_predict

lr = LogisticRegression(class_weight=“balanced”)

Kf = KFold(features.shape[0], random_state=1)

predictions = cross_val_predict(lr, features, target, cv=Kf) predictions = pd.Series(predictions)

# 4 Types results:

false_positives_filter = (predictions == 1) & (loans[“loan_status”] == 0) fp_counts = len(predictions[false_positives_filter])

true_positives_filter = (predictions == 1) & (loans[“loan_status”] == 1) tp_counts = len(predictions[true_positives_filter])

false_negative_filter = (predictions == 0) & (loans[“loan_status”] == 1) fn_counts = len(predictions[false_negative_filter])

true_negative_filter = (predictions == 0) & (loans[“loan_status”] == 0) tn_counts = len(predictions[true_negative_filter])

high_recall = tp_counts / float((fn_counts+tp_counts))

low_fallout = fp_counts / float((tn_counts+fp_counts))

print(“true positives rate =”,high_recall)

print(“false positives rate =”,low_fallout)

print (predictions[:20])

# high_recall = true positives rate

# low_fallout = false positives rate

### better

true positives rate = 0.6221804511278195

false positives rate = 0.43992248062015504

0 1

1 0

2 1

3 1

4 1

5 1

6 1

7 1

8 1

9 0

10 1

11 1

12 0

13 1

14 0

15 1

16 1

17 1

18 1

localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pr… 15/18

5/15/22, 3:44 PM SKLearn in Loan analysis

19 1

dtype: int64

#One way to improve this model is to assign weights to the samples (with Penalties). Increase the weight of negative samples thus weakening the influence of positive samples on the model and increasing the influence of negative samples on the model. We change the default parameter for logistic regression to class_weight=”balanced”. class_weight parameters can help adjust the weight of positive samples and negative samples. Therefore, the true positives rate now is about 62.2%, and the false positives rate now is about 44%. The model has been improved. It has an effect on predictions, but the true positives rate of 62.2% is still relatively low. In order to maximize profits, we want a higher true positives rate and lower false positives rate。

localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pr… 16/18

5/15/22, 3:44 PM SKLearn in Loan analysis

In [48]:

### Model 3 : Customized weights

from sklearn.linear_model import LogisticRegression

from sklearn.cross_validation import cross_val_predict

Customized_weights =

0: 4,

1: 1,

lr = LogisticRegression(class_weight= Customized_weights) Kf = KFold(features.shape[0], random_state=1)

predictions = cross_val_predict(lr, features, target, cv=Kf) predictions = pd.Series(predictions)

# 4 Types results:

false_positives_filter = (predictions == 1) & (loans[“loan_status”] == 0) fp_counts = len(predictions[false_positives_filter])

true_positives_filter = (predictions == 1) & (loans[“loan_status”] == 1) tp_counts = len(predictions[true_positives_filter])

false_negative_filter = (predictions == 0) & (loans[“loan_status”] == 1) fn_counts = len(predictions[false_negative_filter])

true_negative_filter = (predictions == 0) & (loans[“loan_status”] == 0) tn_counts = len(predictions[true_negative_filter])

high_recall = tp_counts / float((fn_counts+tp_counts))

low_fallout = fp_counts / float((tn_counts+fp_counts))

print(“true positives rate =”,high_recall)

print(“false positives rate =”,low_fallout)

print (predictions[:20])

# high_recall = true positives rate

# low_fallout = false positives rate

true positives rate = 0.6282894736842105

false positives rate = 0.4689922480620155

0 1

1 0

2 1

3 1

4 1

5 1

6 1

7 1

8 1

9 0

10 1

11 1

12 0

13 1

14 0

localhost:8888/notebooks/Desktop/ML in loan analysis/SKLearn in Loan analysis.ipynb#An-application-in-ML:-Loan-Forecasting,-and-Maximizing-Pr… 17/18

5/15/22, 3:44 PM SKLearn in Loan analysis

15 1

16 1

17 1

18 1

19 1

dtype: int64

#We can also manually adjust the sample weights as needed, which is very flexible (manual penalties). The proper weights can achieve better results and the best model can be found according to different loan preferences. For example, I assign a positive sample : negative sample = 1:4, then the true positives rate is about 63%, and the false positives rate is about 47%. True positives rate represents profit, false positives rate represents loss. The specific effect we want to achieve depends on the company’s requirements.