Data Science

Understanding Credit Risk Using Machine Learning in Python

Last Updated: 13th June, 2023

Mahima Phalkey

Data Science Consultant at almaBetter

ML helps analyze vast data and predicts credit risk accurately by using advanced algorithms and statistical models

Credit risk modeling using machine learning (ML) involves leveraging ML algorithms and techniques to assess and predict the creditworthiness of borrowers and the likelihood of default. ML models can capture complex patterns and relationships in large datasets, making them useful for credit risk assessment. To know more about it you can read How Machine Learning is Revolutionizing Customer Credit Risk Management article to get a clear understanding of our business problem statement and how we have to implement Credit Risk Machine Learning Python.

When assessing customer credit risk, it is crucial to consider several vital features that can impact the creditworthiness of borrowers in credit risk prediction.

Basic Features
1. Loan amount: The loan is a critical factor in determining credit risk. Generally, the more significant the loan amount, the higher the risk.
2. Loan term: The length of the loan term is another vital consideration. Longer loan terms may be riskier because the borrower is more likely to default over the extended repayment period.
3. Interest rate: The interest rate is a critical factor that impacts the credit risk of the loan. Higher interest rates may make it more challenging for the borrower to repay the loan, and they may be more likely to default.
4. Payment history: The borrower's payment history is crucial in determining credit risk. If the loan applicant has a history of late or missed payments, it suggests that they may have difficulty repaying the loan.
Customer default history
1. The number of past defaults: The number of times a customer has defaulted on a loan or credit product is an essential indicator of credit risk.
2. Time since the last default: The time since the customer's previous default can also be an essential indicator of credit risk.
Customer demography
1. Age: Age is a significant factor in assessing credit risk, as younger customers may need more experience managing finances and have a limited credit history. (Age is not always an excellent feature to use as some Fin Tech industries avoid considering Age as a feature because, in some countries, there may be regulations around the use of Age in credit risk modelling or requirements to ensure that models do not discriminate against certain groups of people.)
2. Income: Income is crucial in determining creditworthiness, as higher-income customers are typically seen as less risky borrowers.
3. Employment status: The employment status of a customer can also impact credit risk. Customers with stable, full-time employment are typically viewed as less risky borrowers than those with part-time or self-employment income.
City level features
1. Economic conditions: Economic conditions in a particular city can significantly impact credit risk. Factors such as unemployment rates, inflation rates, and GDP growth can affect the ability of borrowers to repay their loans.
2. Demographics: Demographic factors such as Age, education level, and income can impact credit risk. For example, younger borrowers or those with lower levels of education may be considered at higher risk due to their lack of financial experience.

Features Selection:

We need to check the features involved in the dataset first to review what they contain.

Limit_BAL: This refers to the amount of the given credit (NT dollar), which includes the individual consumer credit and their family credit (supplementary credit).
SEX: Gender 1 for male and 2 for female.
EDUCATION: 1 for graduation, 2 for university, 3 for high school, and 4 for others.
MARRIAGE: Marital status 1 for married, 2 for single, 3 for divorce, and 0 for others.
AGE: Age.
PAY0 — PAY6: History of past payments where
1. 2: No consumption
2. 1: Paid in total
3. 0: The use of revolving credit
4. 1 = payment delay for one month
5. 2 = payment delay for two months, and so on.
BILL_AMT1- BILL_AMT6: It shows the amount of bill statement, where
a. BILL_AMT1 = amount of bill statement in September 2005
b. BILL_AMT2 = amount of bill statement in August, 2005, and so on.
PAY_AMT1-PAY_AMT2: Amount of last payment, where
a. PAY_AMT1 = amount paid in September 2005
b. PAY_AMT2 = amount paid in August, etc.

We can see that all features in our dataset are relevant for prediction.

Segregating Independent and Dependent Variable

Here default payment next month is our dependent variable which we are going to predict.

X = df.drop(['default.payment.next.month'], axis=1)
y = df['default.payment.next.month']

Scaling our data

Scaling data is essential in many machine learning algorithms, particularly those that use gradient descent optimization methods because it can improve the model's performance and make it more efficient. A few methods which can be used for scaling the data are:

1. StandardScaler: This method scales the data so that the data has a mean of zero and a standard deviation of one.

Frame 851-min.png

2. MinMaxScaler: This method scales the data such that it ranges between 0 and 1.

So, scaling the independent features is very important so that our model is not biased toward the higher range of values. To make all features in the same range here, we can use StandardScaler.

**from** sklearn.preprocessing **import** StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

Splitting into Train and Test Set

Datasets (populations) are divided into the train set and the test set. Depending on the use case, there are many ways to divide the data, for example, 70/30 or 60/40, 75/25 or 80/20, or even 50/50. The ratio of training data to test data should be greater in general. We have divided the dataset into 80/20 split.

Frame 852-min.png

from sklearn.model_selectionimport train_test_split
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.20,random_state=42)

Checking if our Dependent Variable is Imbalanced

In a classification problem, if the dependent variable is imbalanced, meaning that one class has significantly more observations than the other class(es), The model might favour the majority class and produce inaccurate predictions for the minority class.

plt.figure(figsize=(6,6))
sns.countplot(df['default.payment.next.month'], palette='Reds_r')
plt.xticks([0,1], labels=["Not Deafaulted", "Defaulted"])
plt.title("Target Distribution")

Our independent variable is imbalanced, which we need to fix. We can over-sample the defaulted category as there is a lot of difference between Not defaulted and defaulted. A famous and influential data sampling algorithm in machine learning and data mining is SMOTE (Synthetic Minority Oversampling Technique). Rather than oversampling with replacement, SMOTE creates "synthetic" examples for the minority class. The learning package sets the number of k nearest neighbors in the minority class at five by default for the synthetic examples.

**from** imblearn.over_sampling **import** SMOTE
**from** collections **import** Counter

*# summarize the class distribution*
print("Before oversampling: ",Counter(y_train))

*# oversampling strategy*
SMOTE**=** SMOTE()

*# fitting and applying smote to training set*
X_train,y_train**=** SMOTE**.**fit_resample(X_train,y_train)

*# summarize the class distribution*
print("After oversampling: ",Counter(y_train))

After over-sampling we got the results as:

Screenshot_2023-02-20_at_9.34.13_PM.png

Choosing ML Model

After our dataset is ready, we must choose which model to predict loan default. As our dependent variable will be discrete, it is evident that we will go for Classification algorithms. Our model will indicate if the customer will be a defaulter or not a defaulter. Several models can be used for customer credit risk, depending on the data's nature and the analysis's goals. Here are some of the most common models used in customer credit risk assessment:

Logistic regression: A statistical model known as logistic regression can forecast the chance of a binary outcome, such as whether a borrower would stop making payments on a loan. It is a popular model for credit risk assessment because it is relatively simple to implement and interpret.
Decision tree: A decision tree is a model that represents decisions and their potential outcomes using a tree-like structure. It can be used to predict credit risk by considering a set of input variables, such as credit score, income, and loan amount, and then branching to different outcomes based on the values of those variables.
Random forest: A random forest is a ensemble technique that blends different decision trees to increase prediction accuracy. It can predict credit risk by considering multiple input variables and generating a more robust prediction based on the aggregate output of many individual trees.
Neural network: A neural network is a complex machine learning model that can make predictions based on many input variables. It can be used to predict credit risk by training the network on historical data and then using the network to make predictions about new borrowers.
Gradient boosting: Gradient boosting is a ensemble technique that can be used to build a predictive model by iteratively adding weak models to improve the accuracy of the overall prediction. It can predict credit risk by considering multiple input variables and improving the prediction by iteratively adding new models.

Let's take Logistic Regression, for example, to fit our model:

from sklearn.linear_model import LogisticRegression
logit= LogisticRegression()
logit.fit(X_train, y_train)
pred_logit= logit.predict(X_test)
pred_proba=logit.predict_proba(X_test)[:, 1]

Metric to be used to check model performance

Recall is a key metric in customer credit risk assessment because it measures the ability of a model to correctly identify borrowers who are likely to default on their loans. In other words, recall measures the proportion of true positives (i.e., borrowers who default) correctly identified as such by the model.

In the context of credit risk assessment, recall is an important metric because it is often more critical to identify borrowers who are likely to default than to specify borrowers who will not default correctly. This is because false negatives (i.e., borrowers who default but are not identified as such by the model) can result in significant financial losses for lenders and financial institutions.

For example, if a model has a high recall, it means that it is correctly identifying a high proportion of borrowers who are likely to default. This can allow lenders and financial institutions to take proactive steps to mitigate the risk of default, such as offering lower loan amounts or higher interest rates to high-risk borrowers or even declining to offer a loan altogether.

from sklearn.metrics import classification_report, accuracy_score, 
recall_score, confusion_matrix, roc_auc_score, plot_confusion_matrix, 
plot_precision_recall_curve

print("The accuracy of logit model is:", recall_score(y_test, pred_logit))

print(classification_report(y_test, pred_logit))

Screenshot_2023-03-22_at_12.01.51_PM.png

Conclusion

The credit risk modeling for Fintech industry involves several features to assess the borrower's creditworthiness. The dataset used for modeling includes features such as limit balance, gender, education, marital status, age, history of past payments, bill statement amount, and amount of last payment. The independent and dependent variables are segregated, and the data is scaled using StandardScaler, and then the data is split into the train and test sets.

Summary

Understood the basic domain needs in implementing Credit Risk
According to our business need, we choose Logistic Regression as the required model
Evaluated the Model Accuracy

Interview Questions

1. What is the significance of the customer's payment history in determining credit risk?

Answer: The borrower's payment history is a key factor in determining credit risk. If the loan applicant has a history of late or missed payments, it suggests that they may need help repaying the loan.

2. What does the number of past defaults indicate in assessing credit risk?

Answer: The number of past defaults indicates the borrower's creditworthiness. The more the number of past defaults, the higher the credit risk.

3. How does economic conditions in a city impact credit risk?

Answer: Economic conditions in a particular city can significantly impact credit risk. Factors such as unemployment rates, inflation rates, and GDP growth can affect the ability of borrowers to repay their loans.

4. Why is scaling data important in machine learning algorithms?

Answer: Scaling data is important in machine learning algorithms, particularly those that use gradient descent optimization methods, because it can improve the performance of the model and make it more efficient.

5. Why is it important to check if the dependent variable is imbalanced in a classification problem?

Answer: If the dependent variable is imbalanced, meaning that one class has significantly more observations than the other class(es), the model may be biased towards the majority class and may not be able to predict the minority class accurately. Hence, it is important to check if the dependent variable is imbalanced in a classification problem.

Quiz

1.What is the purpose of scaling the data in machine learning algorithms?

To make the data more difficult to work with
To make the model biased towards higher range of values
To improve the performance and efficiency of the model
To increase the population representation of the data

Answer: c. To improve the performance and efficiency of the model

2.Which feature is a critical factor in determining credit risk, where higher values may make it more challenging for the borrower to repay the loan and they may be more likely to default?

Loan term
Payment history
Interest rate
Loan amount

Answer: c. Interest rate