Overview
Data cleaning is the method of preparing a dataset for machine learning algorithms. It includes evaluating the quality of information, taking care of missing values, taking care of outliers, transforming data, merging and deduplicating data, and handling categorical variables. This basic process is required to ensure if the information is ready for machine learning algorithms, as it helps to diminish the hazard of blunders and enhances the accuracy of the models.
Data quality assessment:
Data merging and deduplication in machine learning is the method of combining two or more datasets into one and expelling any duplicate data points. Usually done to guarantee that the information utilized to construct the machine learning models is accurate and complete. Data merging includes combining datasets to preserve the integrity of the information, whereas deduplication includes recognizing and evacuating any duplicate data points from the dataset.
Example
Lets consider iris dataset :
#importing libraries
import pandas as pd
from sklearn.datasets import load_iris
#loading the dataset
iris_data = load_iris()
#storing the data into pandas dataframe
iris_df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
#checking for the shape of the dataset
iris_df.shape
#checking for number of missing values
iris_df.isnull().sum()
#checking for duplicates
iris_df.duplicated().sum()
#checking for outliers
iris_df.describe()
#checking for data imbalance
iris_df.groupby('target').size()
The code example is performing data quality assessment by checking for the shape of the dataset, number of missing values, duplicates, outliers, and data imbalance.
Handling missing values:
Handling missing values in machine learning is an important preprocessing step that is essential for building accurate and reliable models. Missing values can occur for various reasons, such as data entry errors, sensor failures, or simply because certain data points were not collected.
Here are some common strategies for handling missing values in machine learning:
Eventually, the choice of how to handle missing values depends on the particular context and the nature of the missing values. It is critical to carefully consider the preferences and drawbacks of each approach and to select the one that's most suitable for the issue at hand.
Example
#checking for the shape of the dataset
iris_df.shape
#checking for number of missing values
iris_df.isnull().sum()
#replacing the missing values with the median of the feature
iris_df = iris_df.fillna(iris_df.median())
#verifying that there is no missing data
iris_df.isnull().sum()
This code is performing handling missing values by first checking for the number of missing values, then replacing the missing values with the median of the feature, and finally verifying that there is no missing data.
Handling outliers:
Outliers are data points that are significantly different from the rest of the data. Handling outliers in machine learning is the process of identifying and treating outliers in the dataset. This can be done by either dropping the outliers or transforming them. Dropping the outliers means removing the data points that are considered outliers from the dataset. Transforming the outliers means changing the outlier values to make them more consistent with the rest of the data.But how to check for outliers?
We can check for outliers by using below methods:
Example
import seaborn as sns
import matplotlib.pyplot as plt
# Create box plots for each variable
sns.boxplot(x="variable", y="value", data=pd.melt(iris_df))
# Set plot title and labels
plt.title("Box plot of Iris Dataset")
plt.xlabel("Variable")
plt.ylabel("Value")
# Display the plot
plt.show()
This code creates a box plot for each variable (sepal length, sepal width, petal length, and petal width) in the iris dataset. Outliers can be identified as individual data points that fall outside the whiskers of the box plot. You can visually inspect the box plots to identify any outliers.
#removing the outliers using z-score
from scipy import stats
iris_df_z = iris_df[(np.abs(stats.zscore(iris_df)) < 3).all(axis=1)]
# verify that the outliers have been removed
iris_df_z.shape
This code is performing handling outliers by calculating the z-score of the dataset and then removing any data points with a z-score greater than 3. This ensures that any outliers are removed from the dataset.
Data transformation:
Example
#importing the library
from sklearn.preprocessing import StandardScaler
#creating an instance of the StandardScaler
scaler = StandardScaler()
#transforming the data
iris_df_z_transformed = scaler.fit_transform(iris_df_z)
This code example is performing data transformation by using the StandardScaler from the scikit-learn library. The StandardScaler is used to transform the data by scaling it to have a mean of 0 and standard deviation of 1. The data is then stored in the variable iris_df_z_transformed.
Data merging and deduplication:
Data merging and deduplication in machine learning is the process of combining two or more datasets into one and removing any duplicate data points. This is done to ensure that the data used to build the machine learning model is accurate and complete. Data merging involves combining datasets in a way that preserves the integrity of the data, while deduplication involves identifying and removing any duplicate data points from the dataset.
Example
#importing libraries
import pandas as pd
#loading the datasets
iris_data1 = pd.read_csv('iris_data1.csv')
iris_data2 = pd.read_csv('iris_data2.csv')
#merging the datasets
iris_df = pd.concat([iris_data1, iris_data2], axis=0, ignore_index=True)
#checking the shape of the merged dataset
iris_df.shape
#removing the duplicates
iris_df = iris_df.drop_duplicates()
#verifying that the duplicates have been removed
iris_df.shape
This code is used to perform data merging and deduplication on two datasets, 'iris_data1' and 'iris_data2'. The datasets are first merged into a single dataframe, 'iris_df', using the concat() method. Then, the duplicates are removed from the merged dataset using the drop_duplicates() method. Finally, the shape of the dataset is verified to ensure that the duplicates have been successfully removed.
Handling categorical variables:
Dealing with categorical variables is the method of changing categorical data into numerical data. This can be done in order to create the information more appropriate for machine learning algorithms, since most machine learning algorithms work with numerical information. This can be done by utilizing methods such as one-hot encoding, label encoding, and binary encoding.
Example
#checking the data type of the target
iris_data.target.dtype
#converting the target to categorical
iris_data['target'] = iris_data['target'].astype('category')
#recoding the target to 0 and 1
iris_data['target'] = iris_data['target'].cat.codes
#verifying the data type of the target
iris_data.target.dtype
This code is used to handle categorical variables in a dataset. The target column is converted from a numerical data type to a categorical data type and then recoded to 0 and 1. This is useful for machine learning algorithms that require categorical data to be represented as numerical values.
Best practices and guidelines for data cleaning:
Conclusion
Data cleaning is an critical step within the handle of machine learning. It includes evaluating the quality of information, dealing with missing values, taking care of outliers, transforming data, merging and deduplicating data, and dealing with categorical variables.By implementing these best practices and guidelines, we can ensure that our dataset is clean and ready for machine learning algorithms.
Key takeaways
Quiz
Answer:a. To identify and remove errors in data
Answer:c. Replace the missing values with the median
Answer:d. Remove the outliers using z-score
Answer:c. Add more data
Related Tutorials to watch
Top Articles toRead
Read