Decision Tree Tutorial

Identifying risky bank loans using Decision Trees

In this section, we will develop a simple credit approval model using C5.0 decision trees. We will also see how the results of the model can be tuned to minimize errors that result in a financial loss for the institution.

Step 1 – Collecting Data

The idea behind our credit model is to identify factors that are predictive of higher risk of default. Therefore, we need to obtain data on a large number of past bank loans and whether the loan went into default, as well as information on the applicant.
Data with these characteristics is available in a dataset donated to the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml) by Hans Hofmann of the University of Hamburg. The dataset contains information on loans obtained from a credit agency in Germany.
The credit dataset includes 1,000 examples on loans, plus a set of numeric and nominal features indicating the characteristics of the loan and the loan applicant. A class variable indicates whether the loan went into default.

Step 2 – Exploring and Preparing the Data

credit <- read.csv("credit.csv")

str(credit)

table(credit$checking_balance)
< 0 DM > 200 DM 1 – 200 Unknown
274 63 269 394

table(credit$savings_balance)
See the O/P.

The checking and savings account balance may prove to be important predictors of loan default status. Note that since the loan data was obtained from Germany, the currency is recorded in Deutsche Marks (DM). Some of the loan's features are numeric, such as its duration and the amount of credit requested

summary(credit$months_loan_duration)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.0 12.0 18.0 20.9 24.0 72.0

summary(credit$amount)
Min. 1st Qu. Median Mean 3rd Qu. Max.
250 1366 2320 3271 3972 18420

The loan amounts ranged from 250 DM to 18,420 DM across terms of 4 to 72 months with a median duration of 18 months and an amount of 2,320 DM.

The default vector indicates whether the loan applicant was unable to meet the agreed payment terms and went into default.

A total of 30 percent of the loans in this dataset went into default.

table(credit$default)
no yes
700 300

A high rate of default is undesirable for a bank, because it means that the bank is unlikely to fully recover its investment.
If we are successful, our model will identify applicants that are at high risk to default, allowing the bank to refuse credit requests.

Data Preparation

Creating random training and test datasets
As we have done in the previous chapters, we will split our data into two portions: a training dataset to build the decision tree and a test dataset to evaluate the performance of the model on new data.
We will use 90 percent of the data for training and 10 percent for testing, which will provide us with 100 records to simulate new applicants.

Suppose that the bank had sorted the data by the loan amount, with the largest loans at the end of the file. If we used the first 90 percent for training and the remaining 10 percent for testing, we would be training a model on only the small loans and testing the model on the big loans. Obviously, this could be problematic. We'll solve this problem by using a random sample of the credit data for training.

set.seed(123)
train_sample <- sample(1000, 900)

credit_train <- credit[train_sample, ]
credit_test <- credit[-train_sample, ]

dash operator used in the selection of the test records tells R to select records that are not in the specified rows; in other words, the test data includes only the rows that are not in the training sample.

If all went well, we should have about 30 percent of defaulted loans in each of the datasets:

prop.table(table(credit_train$default))
no yes
0.7033333 0.2966667

prop.table(table(credit_test$default))
no yes
0.67 0.33

This appears to be a fairly even split, so we can now build our decision tree.

Step 3 – Training a Model on the Data

We will use the C5.0 algorithm in the C50 package to train our decision tree model. If you have not done so already, install the package with install.packages("C50") and load it to your R session, using library(C50).

credit_model <- C5.0(credit_train[-17], credit_train$default)

summary(credit_model)

Step 4 – Evaluating Model Performance

To apply our decision tree to the test dataset, we use the predict() function, as shown in the following line of code:

credit_pred <- predict(credit_model, credit_test)

This creates a vector of predicted class values, which we can compare to the actual class values using the CrossTable() function in the gmodels package.

library(gmodels)

CrossTable(credit_test$default, credit_pred, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual default', 'predicted default'))

Out of the 100 test loan application records, our model correctly predicted that 59 did not default and 14 did default, resulting in an accuracy of 73 percent and an error rate of 27 percent. This is somewhat worse than its performance on the training data, but not unexpected, given that a model's performance is often worse on unseen data.
Also note that the model only correctly predicted 14 of the 33 actual loan defaults in the test data, or 42 percent. Unfortunately, this type of error is a potentially very costly mistake, as the bank loses money on each default. Let's see if we can improve the result with a bit more effort.

Step 5 – Improving Model Performance

The C5.0() function makes it easy to add boosting to our C5.0 decision tree. We simply need to add an additional trials parameter indicating the number of separate decision trees to use in the boosted team.
The trials parameter sets an upper limit; the algorithm will stop adding trees if it recognizes that additional trials do not seem to be improving the accuracy. We'll start with 10 trials, a number that has become the de facto standard, as research suggests that this reduces error rates on test data by about 25 percent

credit_boost10 <- C5.0(credit_train[-17], credit_train$default, trials = 10)

credit_boost_pred10 <-predict(credit_boost10, credit_test)

CrossTable(credit_test$default, credit_boost_pred10, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual default', 'predicted default'))

Here, we reduced the total error rate from 27 percent prior to boosting down to 18 percent in the boosted model.
It does not seem like a large gain, but it is in fact larger than the 25 percent reduction we expected. On the other hand, the model is still not doing well at predicting defaults, predicting only 20/33 = 61% correctly.

The lack of an even greater improvement may be a function of our relatively small training dataset, or it may just be a very difficult problem to solve.

Making mistakes more costlier than others

Giving a loan out to an applicant who is likely to default can be an expensive mistake. One solution to reduce the number of false negatives may be to reject a larger number of borderline applicants, under the assumption that the interest the bank would earn from a risky loan is far outweighed by the massive loss it would incur if the money is not paid back at all. The C5.0 algorithm allows us to assign a penalty to different types of errors, in order to discourage a tree from making more costly mistakes. The penalties are designated in a cost matrix, which specifies how much costlier each error is, relative to any other prediction. To begin constructing the cost matrix, we need to start by specifying the dimensions.

Since the predicted and actual values can both take two values, yes or no, we need to describe a 2 x 2 matrix, using a list of two vectors, each with two values. At the same time, we'll also name the matrix dimensions to avoid confusion later on

matrix_dimensions <- list(c("no", "yes"), c("no", "yes"))

names(matrix_dimensions) <- c("predicted", "actual")

Examining the new object shows that our dimensions have been set up correctly:

matrix_dimensions

$predicted
[1] "no" "yes"

$actual
[1] "no" "yes"

Next, we need to assign the penalty for the various types of errors by supplying four values to fill the matrix. Since R fills a matrix by filling columns one by one from top to bottom, we need to supply the values in a specific order

• Predicted no, actual no
• Predicted yes, actual no
• Predicted no, actual yes
• Predicted yes, actual yes

Suppose we believe that a loan default costs the bank four times as much as a missed opportunity. Our penalty values could then be defined as:

error_cost <- matrix(c(0, 1, 4, 0), nrow = 2, dimnames = matrix_dimensions)

This creates the following matrix:
error_cost
actual
predicted no yes
no 0 4
yes 1 0

credit_cost <- C5.0(credit_train[-17],

credit_train$default, costs = error_cost)

credit_cost_pred <- predict(credit_cost, credit_test)

CrossTable(credit_test$default, credit_cost_pred, prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, dnn = c('actual default', 'predicted default'))

Compared to our boosted model, this version makes more mistakes overall: 37 percent error here versus 18 percent in the boosted case. However, the types of mistakes are very different.

Where the previous models incorrectly classified only 42 and 61 percent of defaults correctly, in this model, 79 percent of the actual defaults were predicted to be non-defaults. This trade resulting in a reduction of false negatives at the expense of increasing false positives may be acceptable if our cost estimates were accurate.

enjoyed one?
Stay connected.

Decision Tree Tutorial

Search This Blog

Identifying risky bank loans using Decision Trees

Comments

Post a Comment