Junior Data Scientist at ShelfPerks at almaBetter
‘CatBoost’ doesn’t it sound like we boosting some cat from the word ‘Cat’ + ‘Boost’, Let’s deep dive into it, to know what actually it is.
CatBoost is a new machine learning algorithm based on gradient boosting. This algorithm was developed by researchers and engineers at Yandex (Russian tech company) in the year 2017 to serve multi-functional purposes such as Recommendation systems, Personal assistants, Self-driving cars, Weather prediction, and many other tasks. This algorithm performs better than existing boosting algorithms like XGBoost, LightGBM etc in various aspects which includes training period, accuracy, computational power, tuning hyper parameters, implementation etc.
In CatBoost, the term ‘Cat’ refers to Category and ‘Boost’ refers to Boosting, this doesn’t mean that CatBoost is restricted only for categorical features it also supports Numerical and Text features but it has an effective handling technique for categorical data.
The main difference between CatBoost and other boosting algorithms is that the CatBoost uses symmetric trees. And builds them level by level.
Symmetric tree is a tree where nodes of each level use the same split as shown in the above picture. This allows to encode path to leaf with an index which helps in decreasing prediction time, and its extremely important for low latency environments.
How does CatBoost works ?
Let’s look into the dataset which has 10 data points ordered in time.
In order to Calculate residuals for each data point using a model, that has been trained on all the other data points at that time(suppose if we want to calculate residual of x5, we suppose to train model using x1,x2,x3,x4 data points), this procedure becomes computationally expensive when we have large set of data points. In that case instead of training different models for each data point, it trains only log(number_of_datapoints) models. If a model has been trained on n data points then that model is used to calculate residuals for the next n data points.
In the above dataset, we calculate residuals of x5,x6,x7, and x8, using the model that has been trained on x1, x2,x3, and x4 data points. This process is known as ordered boosting.
CatBoost divides a given dataset into random permutations and applies ordered boosting on those random permutations. By default, CatBoost creates four random permutations. With this randomness, we can further stop overfitting of our model, and the randomness can be further controlled by tuning parameters.
What are the features of CatBoost ?
Great quality without parameter tuning : Since the default parameters of CatBoost itself gives better result, there is no need of tuning hyper parameters, which in turn reduces the time spent for tuning.
Categorical features support : CatBoost supports working with non numeric factors because of which time spent for pre-processing the data will be eliminated, and improves the training result too.
Fast and scalable GPU version : Training the model on GPU gives better speedup compared to training the model on CPU. CatBoost efficiently supports multi-card configuration for large datasets.
Improved accuracy : CatBoost reduces over-fitting while constructing models with a novel gradient-boosting scheme.
Fast prediction : CatBoost uses distributed GPUs, this feature enables CatBoost to learn faster and make predictions 13–16 times faster than other algorithms.
When to use CatBoost ?
Where short training time on robust data is required.
when you are working on dataset which has categorical features, and you want to get rid of converting these features into numerical format.
when you need to choose model, which is incredibly faster than many other algorithms.
When to not use CatBoost ?
How to implement the CatBoost algorithm ?
i. Import the libraries/modules needed.
import pandas as pd
import numpy as np
from catboost import CatBoostRegressor
ii. Split your data into train and test.
from sklearn.model_selection import train_split
X_train,X_validation, y_train, y_validation = train_test_split(X,y, train_size = 0.7,random_state =1234)
iii. Train your model using CatBoost.
model= CatBoostRegressor(iterations = 50, depth =3, learning_rate=0.1,loss_function = 'RMSE')
model.fit(X_train,y_train,cat_features = categorica_features_indices, eval_set = (X_validation, y_validation),plot = True)
iv. Finally, predict the model results and evaluate your model.
y_pred = cat.predict(X_test)
What are the applications ?
i. Weather Forecasting : Accurate forecasting of meteorological elements (such as wind, temperature, humidity,rainfall etc) are critical and also can be widely used in many fields. The data is continuously observed and recorded by data mining technique and CatBoost algorithm is highly anticipated to discover future meteorological elements and to improve the accuracy of weather forecasting.
ii. Fraud Detection : Set of process and analyses that allow business to identify and prevent unauthorized activities in various sectors which includes Finance, Banking, Insurance, Medicare etc.
iii. Sales Forecasting : Nowadays sales forecasting has become a vital technology in retail industry, which helps business owners to accurately predict the sales of thousands of products and make optimum decisions based on them.
iv. Loan Default Prediction : Loan default occurs when the loan borrower fails to make timely payment or stops making payment to bank or any financial company, therefore having a model that could predict loan defaulters would be beneficial to banks or financial institutes before approving the loan for particular customer .
and many more applications.
So this is the short overview of CatBoost algorithm, use this algorithm to train your model where you need your results to be two times faster than LightGBM, and twenty times faster than XGBoost.