Credit Risk Prediction - Supervised Learning for Classification

Author

Deri Siswara

Case

You are employed as a data scientist in a risk analysis team within the financial sector.
Your company’s profit is derived from providing loans to customers.
However, there is a risk of financial loss if customers default on their loans.
To mitigate potential losses, it is essential to prevent high-risk applicants (who may default) from being approved for loans.
As a data scientist, your objective is to develop a classification model to distinguish between low-risk and high-risk applicants using customer data, thereby reducing the likelihood of financial loss.

Dataset Description

Detailed data description of Credit Risk dataset:

Description

Feature Name	Description
`person_age`	Age
`person_income`	Annual Income
`person_home_ownership`	Home ownership
`person_emp_length`	Employment length (in years)
`loan_intent`	Loan intent
`loan_grade`	Loan grade
`loan_amnt`	Loan amount
`loan_int_rate`	Interest rate
`loan_status`	Loan status (0 is non-default, 1 is default)
`loan_percent_income`	Percent income
`cb_person_default_on_file`	Historical default
`cb_person_cred_hist_length`	Credit history length

Modeling Workflow

1. Import data to Python
2. Data Preprocessing
3. Training a Machine Learning Models
4. Test Prediction
5. Lets Explore

1. Import data to Python

# Import Numpy and Pandas library
import numpy as np
import pandas as pd

# Function to read the data
def read_data(fname):
    data = pd.read_csv(fname)
    print('Data shape:', data.shape)
    return data
# Read the risk data
data = read_data(fname='credit_risk_dataset.csv')

Data shape: (32581, 12)

data.head()

	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_status	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length
0	22	59000	RENT	123.0	PERSONAL	D	35000	16.02	1	0.59	Y	3
1	21	9600	OWN	5.0	EDUCATION	B	1000	11.14	0	0.10	N	2
2	25	9600	MORTGAGE	1.0	MEDICAL	C	5500	12.87	1	0.57	N	3
3	23	65500	RENT	4.0	MEDICAL	C	35000	15.23	1	0.53	N	2
4	24	54400	RENT	8.0	MEDICAL	C	35000	14.27	1	0.55	Y	4

# Extract all columns name
data.columns

Index(['person_age', 'person_income', 'person_home_ownership',
       'person_emp_length', 'loan_intent', 'loan_grade', 'loan_amnt',
       'loan_int_rate', 'loan_status', 'loan_percent_income',
       'cb_person_default_on_file', 'cb_person_cred_hist_length'],
      dtype='object')

2. Data Preprocessing

The processing pipeline

2.1 Input-Output Split
2.2 Train-Valid-Test Split
2.3 Remove & Preprocess Anomalous Data
2.4 Numerical Imputation
2.5 Feature Engineering the Data
2.6 Create a Preprocessing Function

2.1. Input-Output Split

We’re going to split input & output according to the modeling objective.
Create a function to split the input & output

# Function to split the data target anb features
def split_input_output(data, target_col):
    X = data.drop(columns=target_col)
    y = data[target_col]
    print('X shape:', X.shape)
    print('y shape:', y.shape)
    return X, y

# Load the train data only
X, y = split_input_output(data=data,
                          target_col='loan_status')

X shape: (32581, 11)
y shape: (32581,)

X.head()

	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length
0	22	59000	RENT	123.0	PERSONAL	D	35000	16.02	0.59	Y	3
1	21	9600	OWN	5.0	EDUCATION	B	1000	11.14	0.10	N	2
2	25	9600	MORTGAGE	1.0	MEDICAL	C	5500	12.87	0.57	N	3
3	23	65500	RENT	4.0	MEDICAL	C	35000	15.23	0.53	N	2
4	24	54400	RENT	8.0	MEDICAL	C	35000	14.27	0.55	Y	4

y.head()

0    1
1    0
2    1
3    1
4    1
Name: loan_status, dtype: int64

2.2. Train-Test Split

Now, we want to split the data before modeling.
Split the data into three set:
- Train, for training the model
- Validation, for choosing the best model
- Test, for error generalization
You should make the splitting proportion train (80%), valid (10%), and test (10%)

# Function to split the data into train and test
from sklearn.model_selection import train_test_split
def split_train_test(X, y, test_size=0.2, seed=0): # 0.2 rule of thumb
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
    print('X_train shape:', X_train.shape)
    print('y_train shape:', y_train.shape)
    print('X_test shape:', X_test.shape)
    print('y_test shape:', y_test.shape)
    return X_train, X_test, y_train, y_test

# Split the data
# First, split the train, valid, and test
X_train, X_not, y_train, y_not = split_train_test(X, y, test_size=0.2, seed=0)
X_test, X_valid, y_test, y_valid = split_train_test(X_not, y_not, test_size=0.5, seed=0)

X_train shape: (26064, 11)
y_train shape: (26064,)
X_test shape: (6517, 11)
y_test shape: (6517,)
X_train shape: (3258, 11)
y_train shape: (3258,)
X_test shape: (3259, 11)
y_test shape: (3259,)

# Validate
print(len(X_train)/len(X))  # should be 0.8
print(len(X_test)/len(X))   # should be 0.1
print(len(X_valid)/len(X))  # should be 0.1

0.7999754458119763
0.09999693072649704
0.10002762346152666

The target variable relatively imbalanced.

X_train.head()

	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length
2162	24	30000	MORTGAGE	4.0	PERSONAL	B	5000	10.71	0.17	N	2
7670	24	54000	OWN	8.0	EDUCATION	C	1200	13.11	0.02	Y	3
24007	27	29000	RENT	2.0	PERSONAL	B	10000	12.69	0.34	N	10
25230	29	75840	OWN	5.0	HOMEIMPROVEMENT	G	25000	21.27	0.33	N	6
4897	22	39000	RENT	4.0	MEDICAL	A	5000	6.99	0.13	N	2

EDA before Preprocessing

Find the number of missing values

# Check missing value
100 * (X_train.isna().sum(0) / len(X_train))

person_age                    0.000000
person_income                 0.000000
person_home_ownership         0.000000
person_emp_length             2.693370
loan_intent                   0.000000
loan_grade                    0.000000
loan_amnt                     0.000000
loan_int_rate                 9.706875
loan_percent_income           0.000000
cb_person_default_on_file     0.000000
cb_person_cred_hist_length    0.000000
dtype: float64

We will impute all these variables if there is any missing value
First, check the features distribution

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Identify numeric variables
numeric_vars = X_train.select_dtypes(include=['number']).columns.tolist()
numeric_vars

['person_age',
 'person_income',
 'person_emp_length',
 'loan_amnt',
 'loan_int_rate',
 'loan_percent_income',
 'cb_person_cred_hist_length']

# Plot histogram
fig, ax = plt.subplots(nrows=3, ncols=3, figsize=(12, 8))
axes = ax.flatten()

# Suppress FutureWarning
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Loop through each numeric variable and plot its KDE (Kernel Density Estimate)
for i, col in enumerate(X_train[numeric_vars].columns):
    sns.kdeplot(X_train[col], ax=axes[i], shade=True)
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Density')

# Adjust layout to prevent overlap
plt.tight_layout()

# Display the plot
plt.show()

Summary:

The data contains significant skewness and potential outliers, as evidenced by the KDE plots: - The person_emp_length, person_age, person_incomevariable shows unusually high values. - The loan_amnt and loan_percent_income distributions suggest there might be outliers or extreme values affecting the overall shape of the data. - The presence of these anomalies indicates the need for data cleaning and preprocessing before further analysis. - Given the skewed distribution of most numerical variables, median imputation would be a more robust approach for handling any missing values compared to mean imputation, as it is less affected by outliers.

# Check numerical summary
X_train.describe()

	person_age	person_income	person_emp_length	loan_amnt	loan_int_rate	loan_percent_income	cb_person_cred_hist_length
count	26064.000000	2.606400e+04	25362.000000	26064.000000	23534.000000	26064.000000	26064.000000
mean	27.723181	6.631991e+04	4.808690	9596.278392	11.002234	0.170543	5.797000
std	6.308543	6.581172e+04	4.173959	6313.570925	3.238871	0.107044	4.039502
min	20.000000	4.000000e+03	0.000000	500.000000	5.420000	0.000000	2.000000
25%	23.000000	3.849900e+04	2.000000	5000.000000	7.900000	0.090000	3.000000
50%	26.000000	5.500000e+04	4.000000	8000.000000	10.990000	0.150000	4.000000
75%	30.000000	7.905000e+04	7.000000	12250.000000	13.470000	0.230000	8.000000
max	144.000000	6.000000e+06	123.000000	35000.000000	23.220000	0.830000	30.000000

Let’s find the cut-off value of each features

# 'person_age' has outliers
X_train[X_train['person_age']>90]

	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length
183	144	200000	MORTGAGE	4.0	EDUCATION	B	6000	11.86	0.03	N	2
32297	144	6000000	MORTGAGE	12.0	PERSONAL	C	5000	12.73	0.00	N	25
81	144	250000	RENT	4.0	VENTURE	C	4800	13.57	0.02	N	3

# person_income has outliers
X_train[X_train['person_income']>3000000]

	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length
32297	144	6000000	MORTGAGE	12.0	PERSONAL	C	5000	12.73	0.0	N	25

# person_emp_length has outliers
X_train[X_train['person_emp_length']>50]

	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length
0	22	59000	RENT	123.0	PERSONAL	D	35000	16.02	0.59	Y	3
210	21	192000	MORTGAGE	123.0	VENTURE	A	20000	6.54	0.10	N	4

# Identify categoric  variables
cat_vars = X_train.select_dtypes(exclude=['number']).columns.tolist()
cat_vars

['person_home_ownership',
 'loan_intent',
 'loan_grade',
 'cb_person_default_on_file']

# Loop through each column and print value counts
for col in cat_vars:
    print(f"Value counts for {col}:\n")
    print(X_train[col].value_counts())
    print("\n" + "-"*40 + "\n")

Value counts for person_home_ownership:

person_home_ownership
RENT        13158
MORTGAGE    10755
OWN          2064
OTHER          87
Name: count, dtype: int64

----------------------------------------

Value counts for loan_intent:

loan_intent
EDUCATION            5145
MEDICAL              4869
VENTURE              4556
PERSONAL             4432
DEBTCONSOLIDATION    4156
HOMEIMPROVEMENT      2906
Name: count, dtype: int64

----------------------------------------

Value counts for loan_grade:

loan_grade
A    8641
B    8347
C    5171
D    2874
E     787
F     192
G      52
Name: count, dtype: int64

----------------------------------------

Value counts for cb_person_default_on_file:

cb_person_default_on_file
N    21435
Y     4629
Name: count, dtype: int64

----------------------------------------

Next, explore the loan_status

y_train.value_counts()

loan_status
0    20294
1     5770
Name: count, dtype: int64

Explore the relation between features and loan_status

# Concat the data first
train_data = pd.concat((X_train, y_train), axis=1)
train_data.head()

	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length	loan_status
2162	24	30000	MORTGAGE	4.0	PERSONAL	B	5000	10.71	0.17	N	2	0
7670	24	54000	OWN	8.0	EDUCATION	C	1200	13.11	0.02	Y	3	0
24007	27	29000	RENT	2.0	PERSONAL	B	10000	12.69	0.34	N	10	1
25230	29	75840	OWN	5.0	HOMEIMPROVEMENT	G	25000	21.27	0.33	N	6	1
4897	22	39000	RENT	4.0	MEDICAL	A	5000	6.99	0.13	N	2	0

# Create a heatmap
# Get the correlation matrix (numeric vs numeric)
corr_matrix = train_data[numeric_vars + ['loan_status']].corr()
corr_with_loan_status = corr_matrix['loan_status'].sort_values(ascending=False)
print(corr_with_loan_status)

loan_status                   1.000000
loan_percent_income           0.381282
loan_int_rate                 0.333720
loan_amnt                     0.103017
cb_person_cred_hist_length   -0.016076
person_age                   -0.023025
person_emp_length            -0.082141
person_income                -0.140683
Name: loan_status, dtype: float64

# Plot the heatmap
sns.heatmap(corr_matrix, annot=True)
plt.show()

# Create a barplot for categorical variables vs loan_status

# Set up the figure size and layout
fig, ax = plt.subplots(nrows=len(cat_vars), ncols=1, figsize=(10, 5 * len(cat_vars)))

# Flatten ax in case there's only one categorical variable
if len(cat_vars) == 1:
    ax = [ax]

# Loop through each categorical variable and create a bar plot
for i, col in enumerate(cat_vars):
    sns.countplot(data=train_data, x=col, hue='loan_status', ax=ax[i])
    ax[i].set_title(f'{col} vs loan_status')
    ax[i].set_xlabel(col)
    ax[i].set_ylabel('Count')
    ax[i].legend(title='loan_status')
    plt.xticks(rotation=45)

# Adjust layout to prevent overlap
plt.tight_layout()

# Display the plots
plt.show()

2.3. Remove & Preprocess Anomalous Data

Let’s remove our data from anomalous.
Please see the EDA to help you remove the anomalous data

# Find the data indices to drop based on multiple conditions
idx_to_drop = X_train[(X_train['person_age'] > 90) | 
                      (X_train['person_income'] > 3000000) | 
                      (X_train['person_emp_length'] > 50)].index.tolist()

# Check the index
print(f'Number of index to drop:', len(idx_to_drop))
idx_to_drop

Number of index to drop: 5

[183, 0, 32297, 210, 81]

Now, lets drop the data for X_train and also y_train

X_train_dropped = X_train.drop(index=idx_to_drop)
y_train_dropped = y_train.drop(index=idx_to_drop)

# Validate
print('Shape of X train after dropped:', X_train_dropped.shape)
X_train_dropped.head()

Shape of X train after dropped: (26059, 11)

	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length
2162	24	30000	MORTGAGE	4.0	PERSONAL	B	5000	10.71	0.17	N	2
7670	24	54000	OWN	8.0	EDUCATION	C	1200	13.11	0.02	Y	3
24007	27	29000	RENT	2.0	PERSONAL	B	10000	12.69	0.34	N	10
25230	29	75840	OWN	5.0	HOMEIMPROVEMENT	G	25000	21.27	0.33	N	6
4897	22	39000	RENT	4.0	MEDICAL	A	5000	6.99	0.13	N	2

# Validate
print('Shape of y train after dropped:', y_train_dropped.shape)
y_train_dropped.head()

Shape of y train after dropped: (26059,)

2162     0
7670     0
24007    1
25230    1
4897     0
Name: loan_status, dtype: int64

# Validate
X_train_dropped.describe()

	person_age	person_income	person_emp_length	loan_amnt	loan_int_rate	loan_percent_income	cb_person_cred_hist_length
count	26059.000000	2.605900e+04	25357.000000	26059.000000	23529.000000	26059.000000	26059.000000
mean	27.710273	6.607549e+04	4.799148	9595.402740	11.001991	0.170548	5.796692
std	6.184305	5.457292e+04	4.039969	6311.712618	3.238852	0.107009	4.037979
min	20.000000	4.000000e+03	0.000000	500.000000	5.420000	0.000000	2.000000
25%	23.000000	3.849600e+04	2.000000	5000.000000	7.900000	0.090000	3.000000
50%	26.000000	5.500000e+04	4.000000	8000.000000	10.990000	0.150000	4.000000
75%	30.000000	7.900000e+04	7.000000	12250.000000	13.470000	0.230000	8.000000
max	84.000000	2.039784e+06	41.000000	35000.000000	23.220000	0.830000	30.000000

# Plot histogram
fig, ax = plt.subplots(nrows=3, ncols=3, figsize=(12, 8))
axes = ax.flatten()

for i, col in enumerate(X_train_dropped[numeric_vars].columns):
    sns.kdeplot(X_train_dropped[col], ax=axes[i])
    axes[i].set_title(f'Distribution of {col}')

plt.tight_layout()
plt.show()

# Create a heatmap
# Get the correlation matrix (numeric vs numeric)
corr_matrix = train_data[numeric_vars + ['loan_status']].corr()
corr_with_loan_status = corr_matrix['loan_status'].sort_values(ascending=False)
print(corr_with_loan_status)

loan_status                   1.000000
loan_percent_income           0.381282
loan_int_rate                 0.333720
loan_amnt                     0.103017
cb_person_cred_hist_length   -0.016076
person_age                   -0.023025
person_emp_length            -0.082141
person_income                -0.140683
Name: loan_status, dtype: float64

2.4. Create Imputation

Now, let’s perform a numerical imputation (because all features are numerical)
First check the missing value of the numerical data

# Check missing value
X_train_dropped.isna().sum(0)

person_age                       0
person_income                    0
person_home_ownership            0
person_emp_length              702
loan_intent                      0
loan_grade                       0
loan_amnt                        0
loan_int_rate                 2530
loan_percent_income              0
cb_person_default_on_file        0
cb_person_cred_hist_length       0
dtype: int64

Create a function to fit a numerical features imputer

from sklearn.impute import KNNImputer
import pandas as pd
import numpy as np

# Function to fit the KNN imputer
def num_imputer_fit(data, n_neighbors=5):
    imputer = KNNImputer(n_neighbors=n_neighbors, missing_values=np.nan)
    imputer.fit(data)
    return imputer

# Function to transform the data using the fitted KNN imputer
def num_imputer_transform(data, imputer):
    imputed_data = imputer.transform(data)
    return pd.DataFrame(imputed_data, columns=data.columns)

Perform imputation

# Get the numerical imputer
num_imputer = num_imputer_fit(X_train_dropped[numeric_vars])

# Transform the data
X_train_imputed = num_imputer_transform(X_train_dropped[numeric_vars], num_imputer)

# Reset the indices
X_train_imputed = X_train_imputed.reset_index(drop=True)
X_train_dropped_cat = X_train_dropped[cat_vars].reset_index(drop=True)

# Concatenate the DataFrames
X_train_imputed = pd.concat([X_train_imputed, X_train_dropped_cat], axis=1)

# Validate
X_train_imputed.isna().sum(0)

person_age                    0
person_income                 0
person_emp_length             0
loan_amnt                     0
loan_int_rate                 0
loan_percent_income           0
cb_person_cred_hist_length    0
person_home_ownership         0
loan_intent                   0
loan_grade                    0
cb_person_default_on_file     0
dtype: int64

Great!

2.5. Feature engineering the data

We standardize the data to enhance its performance during model optimization.
We apply one-hot encoding to the data to improve its performance during model optimization.

# Create two functions to perform scaling & transform scaling
from sklearn.preprocessing import StandardScaler
def fit_scaler(data):
    scaler = StandardScaler()
    scaler.fit(data)
    return scaler
def transform_scaler(data, scaler):
    scaled_data = scaler.transform(data)
    return pd.DataFrame(scaled_data, columns=data.columns)

# Fit the scaler
scaler = fit_scaler(X_train_imputed[numeric_vars])

# Transform the scaler
X_train_clean =  transform_scaler(X_train_imputed[numeric_vars], scaler)
X_train_clean = pd.concat([X_train_clean, X_train_imputed[cat_vars]], axis=1)

X_train_clean["loan_grade"].unique()

array(['B', 'C', 'G', 'A', 'D', 'F', 'E'], dtype=object)

from sklearn.preprocessing import OneHotEncoder
def encode_and_one_hot(data, cat_vars, loan_grade_col='loan_grade', default_col='cb_person_default_on_file'):
    # Ordinal encoding for loan_grade
    loan_grade_mapping = {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6, 'G': 7}
    data[loan_grade_col] = data[loan_grade_col].map(loan_grade_mapping)
    
    # Convert 'Y'/'N' in cb_person_default_on_file to 1/0
    default_mapping = {'Y': 1, 'N': 0}
    data[default_col] = data[default_col].map(default_mapping)
    
    # Remove columns that are specifically encoded from cat_vars list
    cat_vars = [col for col in cat_vars if col not in [loan_grade_col, default_col]]
    
    # Apply OneHotEncoder to the specified categorical variables
    encoder = OneHotEncoder(drop='first', sparse=False)  # drop='first' to avoid multicollinearity
    encoded_data = pd.DataFrame(
        encoder.fit_transform(data[cat_vars]),
        columns=encoder.get_feature_names_out(cat_vars)
    )

    # Combine the encoded data with the original data, excluding original categorical columns
    result = pd.concat([data, encoded_data], axis=1)
    result = result.drop(cat_vars, axis=1)
    
    return result
X_train_clean = encode_and_one_hot(X_train_clean, cat_vars)

X_train_clean.describe()

	person_age	person_income	person_emp_length	loan_amnt	loan_int_rate	loan_percent_income	cb_person_cred_hist_length	loan_grade	cb_person_default_on_file	person_home_ownership_OTHER	person_home_ownership_OWN	person_home_ownership_RENT	loan_intent_EDUCATION	loan_intent_HOMEIMPROVEMENT	loan_intent_MEDICAL	loan_intent_PERSONAL	loan_intent_VENTURE
count	2.605900e+04	2.605900e+04	2.605900e+04	2.605900e+04	2.605900e+04	2.605900e+04	2.605900e+04	26059.000000	26059.000000	26059.000000	26059.000000	26059.000000	26059.000000	26059.000000	26059.000000	26059.000000	26059.000000
mean	-2.276769e-17	8.098208e-17	6.843940e-17	-2.481269e-17	2.464909e-16	7.416540e-17	4.512638e-17	2.217353	0.177597	0.003339	0.079205	0.504854	0.197398	0.111516	0.186845	0.169999	0.174757
std	1.000019e+00	1.000019e+00	1.000019e+00	1.000019e+00	1.000019e+00	1.000019e+00	1.000019e+00	1.167605	0.382180	0.057685	0.270063	0.499986	0.398043	0.314776	0.389795	0.375639	0.379767
min	-1.246772e+00	-1.137500e+00	-1.195040e+00	-1.441063e+00	-1.789949e+00	-1.593792e+00	-9.402636e-01	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	-7.616641e-01	-5.053791e-01	-6.949759e-01	-7.280894e-01	-8.600444e-01	-7.527291e-01	-6.926103e-01	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	-2.765558e-01	-2.029523e-01	-1.949117e-01	-2.527734e-01	-3.891071e-03	-1.920204e-01	-4.449569e-01	2.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	3.702552e-01	2.368347e-01	5.551848e-01	4.205908e-01	7.111733e-01	5.555913e-01	5.456566e-01	3.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000
max	9.102204e+00	3.616714e+01	9.056277e+00	4.025070e+00	3.917740e+00	6.162679e+00	5.994031e+00	7.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

X_train_clean.columns

Index(['person_age', 'person_income', 'person_emp_length', 'loan_amnt',
       'loan_int_rate', 'loan_percent_income', 'cb_person_cred_hist_length',
       'loan_grade', 'cb_person_default_on_file',
       'person_home_ownership_OTHER', 'person_home_ownership_OWN',
       'person_home_ownership_RENT', 'loan_intent_EDUCATION',
       'loan_intent_HOMEIMPROVEMENT', 'loan_intent_MEDICAL',
       'loan_intent_PERSONAL', 'loan_intent_VENTURE'],
      dtype='object')

2.6. Create the preprocess function

Now, let’s create a function to preprocess other set of data (valid & test) so that we can predict that

# Create a function to preprocess the dataset
def preprocess_data(data, num_imputer, scaler, cat_vars, numeric_vars):
    
    # Step 2.4: Impute missing numerical values
    num_imputer = num_imputer_fit(data[numeric_vars])
    data_imputed = num_imputer_transform(data[numeric_vars], num_imputer)
   
    # Reset the indices
    data_imputed = data_imputed.reset_index(drop=True)
    data_cat = data[cat_vars].reset_index(drop=True)
    
    # Concatenate the DataFrames
    data_imputed = pd.concat([data_imputed, data_cat], axis=1)
    
    # Step 2.5: Scale the numerical features
    scaler = fit_scaler(data_imputed[numeric_vars])
    data_scaled = transform_scaler(data_imputed[numeric_vars], scaler)
    data_scaled = pd.concat([data_scaled, data_imputed[cat_vars]], axis=1)
    
    # Step 2.6: One-hot encode the categorical features
    clean_data = encode_and_one_hot(data_scaled, cat_vars)

    # Convert the scaled data back to DataFrame
    clean_data = pd.DataFrame(clean_data)
    
    # Output the shape of the original and cleaned data
    print(f"Original data shape: {data.shape}")
    print(f"Cleaned data shape : {clean_data.shape}")
    
    return clean_data

# Preprocess the data training again
X_train_clean = preprocess_data(data=X_train_dropped, num_imputer=num_imputer, scaler=scaler, cat_vars=cat_vars, numeric_vars=numeric_vars)

Original data shape: (26059, 11)
Cleaned data shape : (26059, 17)

# Validate
X_train_clean.head()

	person_age	person_income	person_emp_length	loan_amnt	loan_int_rate	loan_percent_income	cb_person_cred_hist_length	loan_grade	cb_person_default_on_file	person_home_ownership_OWN	person_home_ownership_RENT	loan_intent_EDUCATION	loan_intent_HOMEIMPROVEMENT	loan_intent_MEDICAL	loan_intent_PERSONAL
0	-0.599961	-0.661064	-0.194912	-0.728089	-0.093675	-0.005117	-0.940264	2	0	0.0	0.0	0.0	0.0	0.0	1.0
1	-0.599961	-0.221277	0.805217	-1.330156	0.675901	-1.406889	-0.692610	3	1	1.0	0.0	1.0	0.0	0.0	0.0
2	-0.114853	-0.679388	-0.694976	0.064104	0.541225	1.583557	1.040963	2	0	0.0	1.0	0.0	0.0	0.0	1.0
3	0.208552	0.178929	0.055120	2.440683	3.292460	1.490106	0.050350	7	0	1.0	0.0	0.0	1.0	0.0	0.0
4	-0.923367	-0.496144	-0.194912	-0.728089	-1.286518	-0.378923	-0.940264	1	0	0.0	1.0	0.0	0.0	1.0	0.0

# Transform other set of data
X_valid_clean = preprocess_data(data=X_valid, num_imputer=num_imputer, scaler=scaler, cat_vars=cat_vars, numeric_vars=numeric_vars)
X_test_clean = preprocess_data(data=X_test, num_imputer=num_imputer, scaler=scaler, cat_vars=cat_vars, numeric_vars=numeric_vars)

Original data shape: (3259, 11)
Cleaned data shape : (3259, 17)
Original data shape: (3258, 11)
Cleaned data shape : (3258, 17)

3. Training Machine Learning Models

3.1 Prepare model evaluation function
3.2 Train & evaluate several models
3.3 Choose the best model

3.1. Preprare model evaluation function

Before modeling, let’s prepare two functions
- extract_cv_results: to return the score and best param from hyperparameter search
- evaluate_model: to return the performance metrics of a model

# Function to evaluate the model and tuning hyperparameters

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.metrics import confusion_matrix

def extract_cv_results(cv_obj):
    best_score_train = cv_obj.cv_results_['mean_train_score'][cv_obj.best_index_]
    best_score_valid = cv_obj.cv_results_['mean_test_score'][cv_obj.best_index_]
    best_params = cv_obj.best_params_
    return best_score_train, best_score_valid, best_params

def binary_classification_metrics(y_actual, y_pred):
    accuracy = accuracy_score(y_actual, y_pred)
    f1 = f1_score(y_actual, y_pred)
    precision = precision_score(y_actual, y_pred)
    recall = recall_score(y_actual, y_pred)
    
    print(f"Classification Metrics:")
    print(f"-----------------------")
    print(f"Accuracy : {accuracy:.4f}")
    print(f"F1 Score : {f1:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall   : {recall:.4f}")
    
    return accuracy, f1, precision, recall

3.2. Train and Cross Validate Several Models

# Import sklearn library of those six models + gridsearchcv
from sklearn.dummy import DummyClassifier
# from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

# Perform GridSearchCV for Baseline model
# Set up the DummyClassifier for a baseline model in a classification task
clf_base = GridSearchCV(cv=5, 
                        estimator=DummyClassifier(), 
                        param_grid={'strategy': ['most_frequent', 'stratified', 'prior', 'uniform']}, 
                        return_train_score=True, 
                        scoring='balanced_accuracy')
# Fit the baseline model
clf_base.fit(X_train_clean, y_train_dropped)

GridSearchCV(cv=5, estimator=DummyClassifier(),
             param_grid={'strategy': ['most_frequent', 'stratified', 'prior',
                                      'uniform']},
             return_train_score=True, scoring='balanced_accuracy')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Validate the CV Score
train_base, valid_base, best_param_base = extract_cv_results(clf_base)

print(f'Train score - Baseline model: {train_base}')
print(f'Valid score - Baseline model: {valid_base}')
print(f'Best Params - Baseline model: {best_param_base}')

Train score - Baseline model: 0.49956010208664825
Valid score - Baseline model: 0.5019536180618739
Best Params - Baseline model: {'strategy': 'stratified'}

Perform CV for Logistic Regression Model

# Perform GridSearchCV for Logistic Regression model
param_logit = {
    'penalty': ['l1', 'l2'],  # Regularization types
    'C': [0.001, 0.01, 0.1, 1, 10, 100],  # Inverse of regularization strength (wider range)
    'solver': ['liblinear'],  # 'liblinear' supports both 'l1' and 'l2' penalties
    'class_weight': [None, 'balanced']  # Add class weighting for imbalanced data
}
clf_lr = GridSearchCV(cv=5, 
                      estimator=LogisticRegression(max_iter=1000), 
                      param_grid=param_logit, 
                      return_train_score=True, 
                      scoring='balanced_accuracy')
# Fit the Logistic Regression model
clf_lr.fit(X_train_clean, y_train_dropped)

GridSearchCV(cv=5, estimator=LogisticRegression(max_iter=1000),
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100],
                         'class_weight': [None, 'balanced'],
                         'penalty': ['l1', 'l2'], 'solver': ['liblinear']},
             return_train_score=True, scoring='balanced_accuracy')

# Validate the CV Score
train_lr, valid_lr, best_param_lr = extract_cv_results(clf_lr)

print(f'Train score - LinReg model: {train_lr}')
print(f'Valid score - LinReg model: {valid_lr}')
print(f'Best Params - LinReg model: {best_param_lr}')

Train score - LinReg model: 0.7868440567679686
Valid score - LinReg model: 0.7859840144225727
Best Params - LinReg model: {'C': 1, 'class_weight': 'balanced', 'penalty': 'l1', 'solver': 'liblinear'}

Perform CV for Decision Tree Model

# Perform GridSearchCV for Decision Tree model
param_dt = {
    'max_depth': [5, 10, 15, 20, None],  # More depth options
    'min_samples_split': [2, 5, 10, 20], 
    'min_samples_leaf': [1, 2, 4, 8],  # Add min_samples_leaf parameter
    'criterion': ['gini', 'entropy'],  # Try different splitting criteria
    'class_weight': [None, 'balanced']  # Add class weighting for imbalanced data
}
clf_dt = GridSearchCV(cv=5, 
                      estimator=DecisionTreeClassifier(), 
                      param_grid=param_dt, 
                      return_train_score=True, 
                      scoring='balanced_accuracy')
# Fit the Decision Tree model
clf_dt.fit(X_train_clean, y_train_dropped)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'class_weight': [None, 'balanced'],
                         'criterion': ['gini', 'entropy'],
                         'max_depth': [5, 10, 15, 20, None],
                         'min_samples_leaf': [1, 2, 4, 8],
                         'min_samples_split': [2, 5, 10, 20]},
             return_train_score=True, scoring='balanced_accuracy')

# Validate the CV Score
train_dt, valid_dt, best_param_dt = extract_cv_results(clf_dt)
print(f'Train score - LinReg model: {train_dt}')
print(f'Valid score - LinReg model: {valid_dt}')
print(f'Best Params - LinReg model: {best_param_dt}')

Train score - LinReg model: 0.8992543797595026
Valid score - LinReg model: 0.856304121932123
Best Params - LinReg model: {'class_weight': None, 'criterion': 'gini', 'max_depth': 15, 'min_samples_leaf': 1, 'min_samples_split': 10}

Perform CV for Random Forest Model

# Define parameter grid for Random Forest
param_rf = {
    'n_estimators': [100, 200],
    'max_depth': [10, None],
    'class_weight': [None, 'balanced']
}

# Perform GridSearchCV for Random Forest model
clf_rf = GridSearchCV(
    cv=5,
    estimator=RandomForestClassifier(),
    param_grid=param_rf,
    return_train_score=True,
    scoring='balanced_accuracy'  # Use balanced accuracy for scoring
)

# Fit the Random Forest model
clf_rf.fit(X_train_clean, y_train_dropped)

# Extract and print CV results
train_rf, valid_rf, best_param_rf = extract_cv_results(clf_rf)
print(f'Train score - Random Forest model: {train_rf}')
print(f'Valid score - Random Forest model: {valid_rf}')
print(f'Best Params - Random Forest model: {best_param_rf}')

Train score - Random Forest model: 1.0
Valid score - Random Forest model: 0.8593265141147377
Best Params - Random Forest model: {'class_weight': None, 'max_depth': None, 'n_estimators': 100}

Perform CV for XGBoost Model

# Define parameter grid for XGBoost
param_xgb = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1],
    'scale_pos_weight': [1, sum(y_train_dropped == 0) / sum(y_train_dropped == 1)]
}

# Perform GridSearchCV for XGBoost model
clf_xgb = GridSearchCV(
    cv=5,
    estimator=XGBClassifier(use_label_encoder=False, eval_metric='logloss'), 
    param_grid=param_xgb,
    return_train_score=True,
    scoring='balanced_accuracy'  # Use balanced accuracy for scoring
)

# Fit the XGBoost model
clf_xgb.fit(X_train_clean, y_train_dropped)

# Extract and print CV results
train_xgb, valid_xgb, best_param_xgb = extract_cv_results(clf_xgb)
print(f'Train score - XGBoost model: {train_xgb}')
print(f'Valid score - XGBoost model: {valid_xgb}')
print(f'Best Params - XGBoost model: {best_param_xgb}')

Train score - XGBoost model: 0.9145831986931091
Valid score - XGBoost model: 0.8793141174660957
Best Params - XGBoost model: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200, 'scale_pos_weight': 3.5170740162939853}

4. Find best trashold for each model with data validation and best model with data test

4.1. Apply best parameter to the valid data

import numpy as np
from sklearn.metrics import balanced_accuracy_score

# Define a function to find the best threshold that minimizes financial losses
def find_best_threshold_min_loss(y_true, y_proba, loss_per_FN, loss_per_FP, start=0.3, end=0.7, step=0.01):
    thresholds = []
    losses = []
    
    best_threshold = 0.5
    min_loss = float('inf')
    
    # Loop through thresholds from start to end in specified increments
    for threshold in np.arange(start, end + step, step):
        # Convert probabilities to binary predictions
        y_pred = (y_proba >= threshold).astype(int)
        
        # Calculate the confusion matrix components
        tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
        
        # Calculate financial loss due to false negatives and false positives
        loss_due_to_FN = fn * loss_per_FN
        loss_due_to_FP = fp * loss_per_FP
        
        # Calculate total financial impact
        total_financial_impact = loss_due_to_FN + loss_due_to_FP
        
        # Store threshold and loss
        thresholds.append(threshold)
        losses.append(total_financial_impact)
        
        # Update best threshold if current loss is lower
        if total_financial_impact < min_loss:
            min_loss = total_financial_impact
            best_threshold = threshold
    
    return best_threshold, min_loss, thresholds, losses

import pandas as pd

# Define the loss values
loss_per_FN = 35_000_000  # Rp 35,000,000 for false negatives
loss_per_FP = 10_000_000  # Rp 10,000,000 for false positives

# Collect best thresholds and minimum losses for each model
model_names = ['Baseline', 'Logistic Regression', 'Decision Tree', 'Random Forest', 'XGBoost']
best_thresholds = []
min_losses = []

# Baseline (DummyClassifier)
y_proba_base = clf_base.predict_proba(X_valid_clean)[:, 1]
best_threshold_base, min_loss_base, _, _ = find_best_threshold_min_loss(
    y_valid, y_proba_base, loss_per_FN, loss_per_FP, start=0.3, end=0.7
)
best_thresholds.append(best_threshold_base)
min_losses.append(min_loss_base)

# Logistic Regression
y_proba_lr = clf_lr.predict_proba(X_valid_clean)[:, 1]
best_threshold_lr, min_loss_lr, _, _ = find_best_threshold_min_loss(
    y_valid, y_proba_lr, loss_per_FN, loss_per_FP, start=0.3, end=0.7
)
best_thresholds.append(best_threshold_lr)
min_losses.append(min_loss_lr)

# Decision Tree
y_proba_dt = clf_dt.predict_proba(X_valid_clean)[:, 1]
best_threshold_dt, min_loss_dt, _, _ = find_best_threshold_min_loss(
    y_valid, y_proba_dt, loss_per_FN, loss_per_FP, start=0.3, end=0.7
)
best_thresholds.append(best_threshold_dt)
min_losses.append(min_loss_dt)

# Random Forest
y_proba_rf = clf_rf.predict_proba(X_valid_clean)[:, 1]
best_threshold_rf, min_loss_rf, _, _ = find_best_threshold_min_loss(
    y_valid, y_proba_rf, loss_per_FN, loss_per_FP, start=0.3, end=0.7
)
best_thresholds.append(best_threshold_rf)
min_losses.append(min_loss_rf)

# XGBoost
y_proba_xgb = clf_xgb.predict_proba(X_valid_clean)[:, 1]
best_threshold_xgb, min_loss_xgb, _, _ = find_best_threshold_min_loss(
    y_valid, y_proba_xgb, loss_per_FN, loss_per_FP, start=0.3, end=0.7
)
best_thresholds.append(best_threshold_xgb)
min_losses.append(min_loss_xgb)

# Create a summary DataFrame
summary_df = pd.DataFrame({
    'Model': model_names,
    'Best Threshold': best_thresholds,
    'Minimum Loss': min_losses
})

# Display the summary table
print('Summary of Best Thresholds and Minimum Loss for Each Model:')
print(summary_df)

Summary of Best Thresholds and Minimum Loss for Each Model:
                 Model  Best Threshold  Minimum Loss
0             Baseline            0.30   23805000000
1  Logistic Regression            0.55   10185000000
2        Decision Tree            0.63    7860000000
3        Random Forest            0.32    7925000000
4              XGBoost            0.62    7930000000

from sklearn.metrics import classification_report

# Baseline (DummyClassifier)
y_proba_base = clf_base.predict_proba(X_test_clean)[:, 1]
y_pred_base = (y_proba_base >= best_threshold_base).astype(int)
report_base = classification_report(y_test, y_pred_base)
print("Classification Report for Baseline Model:")
print(report_base)

# Logistic Regression
y_proba_lr = clf_lr.predict_proba(X_test_clean)[:, 1]
y_pred_lr = (y_proba_lr >= best_threshold_lr).astype(int)
report_lr = classification_report(y_test, y_pred_lr)
print("\nClassification Report for Logistic Regression:")
print(report_lr)

# Decision Tree
y_proba_dt = clf_dt.predict_proba(X_test_clean)[:, 1]
y_pred_dt = (y_proba_dt >= best_threshold_dt).astype(int)
report_dt = classification_report(y_test, y_pred_dt)
print("\nClassification Report for Decision Tree:")
print(report_dt)

# Random Forest
y_proba_rf = clf_rf.predict_proba(X_test_clean)[:, 1]
y_pred_rf = (y_proba_rf >= best_threshold_rf).astype(int)
report_rf = classification_report(y_test, y_pred_rf)
print("\nClassification Report for Random Forest:")
print(report_rf)

# XGBoost
y_proba_xgb = clf_xgb.predict_proba(X_test_clean)[:, 1]
y_pred_xgb = (y_proba_xgb >= best_threshold_xgb).astype(int)
report_xgb = classification_report(y_test, y_pred_xgb)
print("\nClassification Report for XGBoost:")
print(report_xgb)

Classification Report for Baseline Model:
              precision    recall  f1-score   support

           0       0.79      0.77      0.78      2589
           1       0.20      0.22      0.21       669

    accuracy                           0.66      3258
   macro avg       0.50      0.50      0.50      3258
weighted avg       0.67      0.66      0.66      3258


Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.93      0.82      0.87      2589
           1       0.52      0.75      0.61       669

    accuracy                           0.80      3258
   macro avg       0.72      0.78      0.74      3258
weighted avg       0.84      0.80      0.82      3258


Classification Report for Decision Tree:
              precision    recall  f1-score   support

           0       0.93      0.91      0.92      2589
           1       0.68      0.75      0.71       669

    accuracy                           0.88      3258
   macro avg       0.81      0.83      0.82      3258
weighted avg       0.88      0.88      0.88      3258


Classification Report for Random Forest:
              precision    recall  f1-score   support

           0       0.94      0.88      0.91      2589
           1       0.63      0.78      0.70       669

    accuracy                           0.86      3258
   macro avg       0.78      0.83      0.80      3258
weighted avg       0.88      0.86      0.87      3258


Classification Report for XGBoost:
              precision    recall  f1-score   support

           0       0.94      0.92      0.93      2589
           1       0.70      0.77      0.73       669

    accuracy                           0.89      3258
   macro avg       0.82      0.84      0.83      3258
weighted avg       0.89      0.89      0.89      3258

The classification reports show that all machine learning models outperform the baseline, especially in identifying defaulters (class 1). Among the models:

XGBoost achieves the highest accuracy (89%) and the best balance between precision (70%) and recall (77%) for defaulters.
Decision Tree and Random Forest also perform well, with high recall and good precision for class 1.
Logistic Regression improves recall but has lower precision compared to tree-based models.
The Baseline model performs poorly for defaulters.

Conclusion:
XGBoost is the best model overall, providing the highest accuracy and strong performance in both precision and recall for identifying defaulters. This makes it the most effective choice for minimizing financial risk in this context.

5. Model Evaluation and Financial Impact

5.1 Model Evaluation

My machine learning workflow involved training several models using the training set to find the best hyperparameters, tuning the threshold with validation data to minimize financial loss impact, and finally evaluating performance with test data as out-of-sample data.

Among all models tested, XGBoost emerged as the best performer with the following advantages:

Highest Overall Accuracy (89%): XGBoost correctly classified 89% of all loan applications.
Best Balanced Performance: It achieved the best balance between precision (70%) and recall (77%) for identifying defaulters.
Effective Hyperparameters: The best configuration included:
- 200 estimators (trees)
- Maximum depth of 5
- Learning rate of 0.1
- Scale positive weight of 3.52 to handle class imbalance

The optimal probability threshold of 0.62 was determined to minimize financial losses, striking a balance between false positives and false negatives.

from sklearn.metrics import confusion_matrix

# Confusion matrix for each model on the test set
cm_base = confusion_matrix(y_test, y_pred_base)
cm_lr = confusion_matrix(y_test, y_pred_lr)
cm_dt = confusion_matrix(y_test, y_pred_dt)
cm_rf = confusion_matrix(y_test, y_pred_rf)
cm_xgb = confusion_matrix(y_test, y_pred_xgb)

print("Confusion Matrix - Baseline:\n", cm_base)
print("\nConfusion Matrix - Logistic Regression:\n", cm_lr)
print("\nConfusion Matrix - Decision Tree:\n", cm_dt)
print("\nConfusion Matrix - Random Forest:\n", cm_rf)
print("\nConfusion Matrix - XGBoost:\n", cm_xgb)

Confusion Matrix - Baseline:
 [[1991  598]
 [ 519  150]]

Confusion Matrix - Logistic Regression:
 [[2123  466]
 [ 170  499]]

Confusion Matrix - Decision Tree:
 [[2353  236]
 [ 167  502]]

Confusion Matrix - Random Forest:
 [[2280  309]
 [ 147  522]]

Confusion Matrix - XGBoost:
 [[2370  219]
 [ 155  514]]

# y_test table
y_test.value_counts()

loan_status
0    2589
1     669
Name: count, dtype: int64

5.2 Financial Impact

XGBoost Performance on Test Data:

The confusion matrix for XGBoost is:

Confusion Matrix - XGBoost:
[[2370 219]
[ 155 514]]

True Negatives (TN): 2370 (Correctly identified good applicants)
False Positives (FP): 219 (Good applicants incorrectly flagged as risky)
False Negatives (FN): 155 (Defaulters incorrectly approved)
True Positives (TP): 514 (Correctly identified defaulters)

Financial Analysis:

Given our financial assumptions:

False Negatives cost: Rp 35,000,000 per applicant (approving bad loans)
False Positives cost: Rp 10,000,000 per applicant (rejecting good loans)

XGBoost’s financial impact:

Loss due to False Negatives: Rp 5,425,000,000 (155 × Rp 35M)
Loss due to False Positives: Rp 2,190,000,000 (219 × Rp 10M)
Total Financial Impact: Rp 7,615,000,000

Comparison with Baseline:

Baseline model total loss: Rp 23,805,000,000
XGBoost total loss: Rp 7,615,000,000
Cost savings: Rp 16,190,000,000 (68% reduction)

The XGBoost model significantly reduces financial losses compared to the baseline approach. The higher precision in identifying defaulters results in fewer costly false negatives, while maintaining an acceptable level of false positives.

Business Implications:

Risk Reduction: The model effectively identifies 77% of potential defaulters before loans are approved.
Revenue Preservation: While being cautious, the model still approves 92% of good applications, preserving most revenue opportunities.
Cost-Effective Solution: For every 1,000 applications processed, the model saves approximately Rp 5 billion compared to the baseline.

In conclusion, implementing the XGBoost model with the optimized threshold of 0.62 provides a robust solution for credit risk assessment that significantly reduces financial losses while maintaining business viability.