Mahima Phalkey
Data Science Consultant at almaBetter
ML helps analyze vast data and predicts credit risk accurately by using advanced algorithms and statistical models
Credit risk modeling using machine learning (ML) involves leveraging ML algorithms and techniques to assess and predict the creditworthiness of borrowers and the likelihood of default. ML models can capture complex patterns and relationships in large datasets, making them useful for credit risk assessment. To know more about it you can read How Machine Learning is Revolutionizing Customer Credit Risk Management article to get a clear understanding of our business problem statement and how we have to implement Credit Risk Machine Learning Python.
When assessing customer credit risk, it is crucial to consider several vital features that can impact the creditworthiness of borrowers in credit risk prediction.
We need to check the features involved in the dataset first to review what they contain.
We can see that all features in our dataset are relevant for prediction.
Here default payment next month is our dependent variable which we are going to predict.
X = df.drop(['default.payment.next.month'], axis=1)
y = df['default.payment.next.month']
Scaling data is essential in many machine learning algorithms, particularly those that use gradient descent optimization methods because it can improve the model's performance and make it more efficient. A few methods which can be used for scaling the data are:
1. StandardScaler: This method scales the data so that the data has a mean of zero and a standard deviation of one.
2. MinMaxScaler: This method scales the data such that it ranges between 0 and 1.
So, scaling the independent features is very important so that our model is not biased toward the higher range of values. To make all features in the same range here, we can use StandardScaler.
**from** sklearn.preprocessing **import** StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
Datasets (populations) are divided into the train set and the test set. Depending on the use case, there are many ways to divide the data, for example, 70/30 or 60/40, 75/25 or 80/20, or even 50/50. The ratio of training data to test data should be greater in general. We have divided the dataset into 80/20 split.
from sklearn.model_selectionimport train_test_split
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.20,random_state=42)
In a classification problem, if the dependent variable is imbalanced, meaning that one class has significantly more observations than the other class(es), The model might favour the majority class and produce inaccurate predictions for the minority class.
plt.figure(figsize=(6,6))
sns.countplot(df['default.payment.next.month'], palette='Reds_r')
plt.xticks([0,1], labels=["Not Deafaulted", "Defaulted"])
plt.title("Target Distribution")
Our independent variable is imbalanced, which we need to fix. We can over-sample the defaulted category as there is a lot of difference between Not defaulted and defaulted. A famous and influential data sampling algorithm in machine learning and data mining is SMOTE (Synthetic Minority Oversampling Technique). Rather than oversampling with replacement, SMOTE creates "synthetic" examples for the minority class. The learning package sets the number of k nearest neighbors in the minority class at five by default for the synthetic examples.
**from** imblearn.over_sampling **import** SMOTE
**from** collections **import** Counter
*# summarize the class distribution*
print("Before oversampling: ",Counter(y_train))
*# oversampling strategy*
SMOTE**=** SMOTE()
*# fitting and applying smote to training set*
X_train,y_train**=** SMOTE**.**fit_resample(X_train,y_train)
*# summarize the class distribution*
print("After oversampling: ",Counter(y_train))
After over-sampling we got the results as:
After our dataset is ready, we must choose which model to predict loan default. As our dependent variable will be discrete, it is evident that we will go for Classification algorithms. Our model will indicate if the customer will be a defaulter or not a defaulter. Several models can be used for customer credit risk, depending on the data's nature and the analysis's goals. Here are some of the most common models used in customer credit risk assessment:
Let's take Logistic Regression, for example, to fit our model:
from sklearn.linear_model import LogisticRegression
logit= LogisticRegression()
logit.fit(X_train, y_train)
pred_logit= logit.predict(X_test)
pred_proba=logit.predict_proba(X_test)[:, 1]
Recall is a key metric in customer credit risk assessment because it measures the ability of a model to correctly identify borrowers who are likely to default on their loans. In other words, recall measures the proportion of true positives (i.e., borrowers who default) correctly identified as such by the model.
In the context of credit risk assessment, recall is an important metric because it is often more critical to identify borrowers who are likely to default than to specify borrowers who will not default correctly. This is because false negatives (i.e., borrowers who default but are not identified as such by the model) can result in significant financial losses for lenders and financial institutions.
For example, if a model has a high recall, it means that it is correctly identifying a high proportion of borrowers who are likely to default. This can allow lenders and financial institutions to take proactive steps to mitigate the risk of default, such as offering lower loan amounts or higher interest rates to high-risk borrowers or even declining to offer a loan altogether.
from sklearn.metrics import classification_report, accuracy_score,
recall_score, confusion_matrix, roc_auc_score, plot_confusion_matrix,
plot_precision_recall_curve
print("The accuracy of logit model is:", recall_score(y_test, pred_logit))
print(classification_report(y_test, pred_logit))
Conclusion
The credit risk modeling for Fintech industry involves several features to assess the borrower's creditworthiness. The dataset used for modeling includes features such as limit balance, gender, education, marital status, age, history of past payments, bill statement amount, and amount of last payment. The independent and dependent variables are segregated, and the data is scaled using StandardScaler, and then the data is split into the train and test sets.
1. What is the significance of the customer's payment history in determining credit risk?
Answer: The borrower's payment history is a key factor in determining credit risk. If the loan applicant has a history of late or missed payments, it suggests that they may need help repaying the loan.
2. What does the number of past defaults indicate in assessing credit risk?
Answer: The number of past defaults indicates the borrower's creditworthiness. The more the number of past defaults, the higher the credit risk.
3. How does economic conditions in a city impact credit risk?
Answer: Economic conditions in a particular city can significantly impact credit risk. Factors such as unemployment rates, inflation rates, and GDP growth can affect the ability of borrowers to repay their loans.
4. Why is scaling data important in machine learning algorithms?
Answer: Scaling data is important in machine learning algorithms, particularly those that use gradient descent optimization methods, because it can improve the performance of the model and make it more efficient.
5. Why is it important to check if the dependent variable is imbalanced in a classification problem?
Answer: If the dependent variable is imbalanced, meaning that one class has significantly more observations than the other class(es), the model may be biased towards the majority class and may not be able to predict the minority class accurately. Hence, it is important to check if the dependent variable is imbalanced in a classification problem.
1.What is the purpose of scaling the data in machine learning algorithms?
Answer: c. To improve the performance and efficiency of the model
2.Which feature is a critical factor in determining credit risk, where higher values may make it more challenging for the borrower to repay the loan and they may be more likely to default?
Answer: c. Interest rate
Related Articles
Top Tutorials