Data Science

Filter Methods For Feature Selection in Machine Learning

Last Updated: 12th October, 2023

Gurneet Kaur

Data Science Consultant at almaBetter

Filter methods for feature selection in machine learning are like detectives hunting for clues. They comb through the dataset, evaluating each feature’s relevance and importance, to find the most valuable hints leading to an accurate prediction. These methods filter out the irrelevant features, leaving only the most informative and relevant ones to be used in building the model.

1 (1).png

The article includes following:

What is feature selection in Machine Learning?
Importance of feature selection in Machine Learning
Steps in Filter Methods for feature selection
Pearson’s Correlation
Linear Discriminant Analysis
Analysis of Variance
Apply Chi-Square
Conclusion

What is feature selection in Machine Learning?

Feature selection selects a subset of the most relevant predicting features for machine learning model building. For instance, we prepared a model by selecting all the features, and got an accuracy of around 65%, which is not pretty good for a predictive model. After making some feature selections without making any logical changes in our model code, the accuracy jumped to 81% which is quite impressive.

Feature selection in machine learning is the process of identifying the most valuable and relevant set of features in a dataset to improve the accuracy and performance of a machine learning model. Think of a dataset as a vast ocean of information and feature selection as the process of finding the pearls hidden within. It’s important because not all features are equally valuable or informative, and using irrelevant features can lead to poor model performance and longer training times. Different methods can be used for feature selection, such as filter, wrapper, and embedded methods.

And like a treasure hunt, feature selection can be an iterative process, requiring multiple attempts and adjustments to find the optimal set of features. So, it’s not just a matter of finding the right features but also selecting the right amount.

2 (3).png

Importance of feature selection in Machine Learning

Feature selection is the process of choosing which pieces of information (or “features”) are important for a computer to make a prediction or a decision. For example, imagine you’re trying to buy a new house and looking at houses in different neighborhoods. Each house has many features, like the number of bedrooms, the size of the backyard, and the distance to the nearest grocery store. However, you might not care about all those features when deciding which house to buy. For example, you might not care how many bedrooms a house has if you don’t have kids. The process of feature selection is like figuring out which features are important for you to consider.

By selecting the right set of features, you can improve the accuracy of your model, speed up the training process, and make it more interpretable.

Let’s consider another example, given that you’re building a useful model to predict whether or not a patient has a specific disease. The model has many features, including the patient’s age, blood pressure, cholesterol level, and family history. After applying feature selection, the model only includes the most informative features, such as the patient’s age and cholesterol level. As a result, the model’s accuracy improves significantly, and the training time is faster. This highlights the importance of feature selection, as it can lead to a more accurate and efficient model.

3 (3).png

Steps in Filter Methods for feature selection

Evaluate the relevance of each feature using statistical measures such as Pearson’s Correlation, Linear Discriminant Analysis, Analysis of Variance, and Chi-Square methods.

Rank the features based on their relevance scores.

Select a threshold for relevance or a certain number of top-ranking features.

Remove all features that fall below the threshold or are not in the top-ranking group.

Use the remaining features to train the model.

Repeat the process with different thresholds or numbers of top-ranking features to find the optimal set of features.

Pearson’s Correlation

Pearson’s Correlation is a statistical measure that evaluates the linear relationship between two variables. It ranges from -1 to 1, where -1 indicates a perfect negative linear relationship, 0 shows no connection, and 1 indicates a perfect positive linear relationship. It can be a valuable tool in filter methods for feature selection in machine learning because it evaluates how strongly a feature has a linear relationship with the target variable.

4 (3).png

For example, let’s say you’re building a model to predict the price of a car based on various features such as the car’s make, model, year, and mileage. You can use Pearson’s Correlation to evaluate the relationship between each feature and the car’s price. You may find that the car’s make and model have a high correlation with the price, indicating that they are essential features of the model. In contrast, the year and mileage have a low correlation, suggesting that they may not be as informative.

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is a supervised dimensionality reduction technique that can be used for feature selection in machine learning. It works by projecting the original feature space onto a lower-dimensional space that maximises the separation between different classes. Finding a linear feature combination that best distinguishes the other classes in the dataset is the goal of LDA.

For example, let’s say you’re building a model to classify images of animals into different species. The model has features such as the animal’s shape, colour, and texture. LDA can identify the most relevant features for classifying the animals by finding a linear combination of features that maximises the separation between the different species. For example, LDA may find that the shape and texture of the animal are the most relevant features for classification, while the colour is less informative.

5 (2).png

LDA can be a powerful tool in feature selection because it considers the class labels, making it well-suited for classification problems. However, it assumes that the data is usually distributed and that the classes have equal covariance matrices, which may not always be the case. Additionally, it is sensitive to outliers, so it should be used in conjunction with other feature selection methods.

Analysis of Variance

A statistical method called Analysis of Variance (ANOVA) can be used to choose features in machine learning. It is used to determine whether the means of two or more groups for a particular feature differ significantly. Finding characteristics that have a significant impact on the target variable can be done with the help of ANOVA.

For example, let’s say you’re building a model to predict the yield of a crop based on various features such as temperature, rainfall, and fertilizer type. You can use ANOVA to evaluate whether there is a significant difference in the yield between different temperature levels, rain, and fertilizer type. ANOVA may find that temperature and fertilizer type significantly affect the yield, while rainfall does not.

This suggests that temperature and fertilizer type are essential features of the model. ANOVA is a powerful tool in feature selection because it can be used to identify features that have a significant effect on the target variable, and it can handle multiple groups. However, it makes unverified assumptions that the variances of the groups are equal and that the data are normally distributed. Additionally, it should be used in conjunction with other feature selection methods.

6 (1).png

Apply Chi-Square

The Chi-Square test is a statistical method for feature selection in machine learning, especially for categorical variables. It is used to evaluate the relationship between two categorical variables and measure their degree of association. The Chi-Square test can be a valuable tool for identifying features strongly related to the target variable.

For example, consider creating a model to forecast whether a buyer will purchase a product based on age, gender, and education levels. You can use the Chi-Square test to evaluate the relationship between each feature and the target variable (purchase). For example, gender and education level strongly associate with the purchase, indicating that they are essential features of the model. In contrast, age does not have a significant association and can be less informative.

When selecting features, the Chi-Square test can be effective because it is well-suited for categorical variables and can measure the strength of association between them and the target variable. However, it assumes that the sample size is large enough; otherwise, the results may not be reliable. Additionally, it should be used in conjunction with other feature selection methods.

Let us see a practical implementation of feature selection using Pearson’s Correlation.

We are using a sample dataset.

About the data:

Consider the data set that concerns the hardening of cement. In particular, the researchers were interested in learning how the composition of the cement affects the heat emanated during hardening of the cement. Therefore, they measured and recorded the following data on 13 batches of cement. Variables of this model were,

Response y: Heat emanated in calories during hardening of cement on a per gram basis

Predictor x1: % of tricalcium aluminate

Predictor x2: % of tricalcium silicate

Predictor x3: % of tetracalcium alumino ferrite

Predictor x4: % of dicalcium silicate

Step1: Import the required libraries and load the data

7 (2).png

Now, let us visually see the correlation between the features using the pairplot:

9 (1).png

10 (1).png

We can find the value of correlation between the different features using the corr() function:

12 (1).png

We can clearly see the different values of correlation between the variables.

Let us plot a heatmap to get a clear view:

13 (1).png

Using pandas Pearson Correlation:

Selecting features with threshold as 0.6. The code here is used to eliminate all features that have correlation coefficient less than 0.6. Here the correlation is found between the response variable and the predictor variable, hence, retaining the independent variables that are highly positively correlated to the response variable.

The size of the data set has been reduced with three columns.

Building a Regression Model:

First we split the dataset to train and test the dataset. 20% of the data is used to create the test data and 80% to create the train data

Building the model with selected features:

Building the regression model with all features:

Let us see the difference in the metrics in a tabular format.

From the output provided it appears that the model with selected features (using Pearson Correlation) performed better than the model with all features on the cement dataset. The model with selected features had higher explained variance, lower mean absolute error, mean square error, mean squared log error, and higher R2 and adjusted R2 values compared to the model with all features. This suggests that the model with selected features was able to better capture the underlying relationship between the predictors and the target variable.

Conclusion

In simple words, filter methods for feature selection in machine learning are a way to pick the most valuable information from a large data set that can help your model make better predictions. It’s like looking for the needle in a haystack, where the needle is the essential feature that is relevant to your problem. By using filter methods, you can eliminate irrelevant features and increase the accuracy of your model.

For example, imagine you are trying to predict a song’s popularity; then, the features such as lyrics, beats, and artist would be necessary, while the color of the album cover would not be. By using filter methods to select only the most critical features, you can make better predictions about the song’s popularity.

If you are interested in carving a career in the field of Data Science, AlmaBetter’s Full Stack Data Science can be the perfect fit for you. Sign up for our program and become a coveted Data Science and Analytics professional in 30 weeks.

Read our latest blog on “What are the top companies hiring Data Science freshers?”.