implementation of credit risk using ml
Data Science Consultant at almaBetter
You can read How Machine Learning is Revolutionizing Customer Credit Risk Management article to get a clear understanding of our business problem statement and how we have to implement the ML model for Credit Risk.
Limit_BAL: This refers to the amount of the given credit (NT dollar), which includes the individual consumer credit and their family credit (supplementary credit).
SEX: Gender 1 for male and 2 for female.
EDUCATION: 1 for graduation, 2 for university, 3 for high school, and 4 for others.
MARRIAGE: Marital status 1 for married, 2 for single, 3 for divorce, and 0 for others.
PAY0 — PAY6: History of past payments where
BILL_AMT1- BILL_AMT6: It shows the amount of bill statement, where
a. BILL_AMT1 = amount of bill statement in September 2005
b. BILL_AMT2 = amount of bill statement in August, 2005, and so on.
PAY_AMT1-PAY_AMT2: Amount of last payment, where
a. PAY_AMT1 = amount paid in September 2005
b. PAY_AMT2 = amount paid in August, etc.
We can see that all features in our dataset are relevant for prediction.
Here default payment next month is our dependent variable which we are going to predict.
X = df.drop(['default.payment.next.month'], axis=1) y = df['default.payment.next.month']
Scaling data is essential in many machine learning algorithms, particularly those that use gradient descent optimization methods because it can improve the model's performance and make it more efficient. A few methods which can be used for scaling the data are:
So, scaling the independent features is very important so that our model is not biased toward the higher range of values. To make all features in the same range here, we can use StandardScaler
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X = scaler.fit_transform(X)
Datasets (populations) are divided into the train set and the test set. Depending on the use case, there are many ways to divide the data, for example, 70/30 or 60/40, 75/25 or 80/20, or even 50/50. The ratio of training data to test data should be greater in general. We have divided the dataset into 80/20 split.
from sklearn.model_selectionimport train_test_split X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.20,random_state=42)
In a classification problem, if the dependent variable is imbalanced, meaning that one class has significantly more observations than the other class(es), The model might favor the majority class and produce inaccurate predictions for the minority class.
plt.figure(figsize=(6,6)) sns.countplot(df['default.payment.next.month'], palette='Reds_r') plt.xticks([0,1], labels=["Not Deafaulted", "Defaulted"]) plt.title("Target Distribution")
Our independent variable is imbalanced, which we need to fix. We can over-sample the defaulted category as there is a lot of difference between Not defaulted and defaulted. A famous and influential data sampling algorithm in machine learning and data mining is SMOTE (Synthetic Minority Oversampling Technique). Rather than oversampling with replacement, SMOTE creates "synthetic" examples for the minority class. The learning package sets the number of k nearest neighbors in the minority class at five by default for the synthetic examples.
from imblearn.over_sampling import SMOTE from collections import Counter # summarize the class distribution print("Before oversampling: ",Counter(y_train)) # oversampling strategy SMOTE= SMOTE() # fitting and applying smote to training set X_train,y_train= SMOTE.fit_resample(X_train,y_train) # summarize the class distribution print("After oversampling: ",Counter(y_train))
After over-sampling we got the results as:
After our dataset is ready, we must choose which model to predict loan default. As our dependent variable will be discrete, it is evident that we will go for Classification algorithms. Our model will indicate if the customer will be a defaulter or not a defaulter. Several models can be used for customer credit risk, depending on the data's nature and the analysis's goals. Here are some of the most common models used in customer credit risk assessment:
Let's take Logistic Regression, for example, to fit our model:
from sklearn.linear_model import LogisticRegression logit= LogisticRegression() logit.fit(X_train, y_train) pred_logit= logit.predict(X_test) pred_proba=logit.predict_proba(X_test)[:, 1]
Recall is a key metric in customer credit risk assessment because it measures the ability of a model to correctly identify borrowers who are likely to default on their loans. In other words, recall measures the proportion of true positives (i.e., borrowers who default) correctly identified as such by the model.
In the context of credit risk assessment, recall is an important metric because it is often more critical to identify borrowers who are likely to default than to specify borrowers who will not default correctly. This is because false negatives (i.e., borrowers who default but are not identified as such by the model) can result in significant financial losses for lenders and financial institutions.
For example, if a model has a high recall, it means that it is correctly identifying a high proportion of borrowers who are likely to default. This can allow lenders and financial institutions to take proactive steps to mitigate the risk of default, such as offering lower loan amounts or higher interest rates to high-risk borrowers or even declining to offer a loan altogether.
from sklearn.metrics import classification_report, accuracy_score, recall_score, confusion_matrix, roc_auc_score, plot_confusion_matrix, plot_precision_recall_curve print("The accuracy of logit model is:", recall_score(y_test, pred_logit)) print(classification_report(y_test, pred_logit))
The credit risk modeling for Fintech industry involves several features to assess the borrower's creditworthiness. The dataset used for modeling includes features such as limit balance, gender, education, marital status, age, history of past payments, bill statement amount, and amount of last payment. The independent and dependent variables are segregated, and the data is scaled using StandardScaler, and then the data is split into the train and test sets.
1. What is the significance of the customer's payment history in determining credit risk?
Answer: The borrower's payment history is a key factor in determining credit risk. If the loan applicant has a history of late or missed payments, it suggests that they may have difficulty repaying the loan.
2. What does the number of past defaults indicate in assessing credit risk?
The number of past defaults indicates the borrower's creditworthiness. The more the number of past defaults, the higher the credit risk.
3. How does economic conditions in a city impact credit risk?
Economic conditions in a particular city can significantly impact credit risk. Factors such as unemployment rates, inflation rates, and GDP growth can affect the ability of borrowers to repay their loans.
4. Why is scaling data important in machine learning algorithms?
Answer: Scaling data is important in machine learning algorithms, particularly those that use gradient descent optimization methods, because it can improve the performance of the model and make it more efficient.
5. Why is it important to check if the dependent variable is imbalanced in a classification problem?
Answer: If the dependent variable is imbalanced, meaning that one class has significantly more observations than the other class(es), the model may be biased towards the majority class and may not be able to predict the minority class accurately. Hence, it is important to check if the dependent variable is imbalanced in a classification problem.