Data cleaning is the method of preparing a dataset for machine learning algorithms. It includes evaluating the quality of information, taking care of missing values, taking care of outliers, transforming data, merging and deduplicating data, and handling categorical variables. This basic process is required to ensure if the information is ready for machine learning algorithms, as it helps to diminish the hazard of blunders and enhances the accuracy of the models.
Data quality assessment:
Data merging and deduplication in machine learning is the method of combining two or more datasets into one and expelling any duplicate data points. Usually done to guarantee that the information utilized to construct the machine learning models is accurate and complete. Data merging includes combining datasets to preserve the integrity of the information, whereas deduplication includes recognizing and evacuating any duplicate data points from the dataset.
Lets consider iris dataset :
#importing libraries import pandas as pd from sklearn.datasets import load_iris #loading the dataset iris_data = load_iris() #storing the data into pandas dataframe iris_df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names) #checking for the shape of the dataset iris_df.shape #checking for number of missing values iris_df.isnull().sum() #checking for duplicates iris_df.duplicated().sum() #checking for outliers iris_df.describe() #checking for data imbalance iris_df.groupby('target').size()
The code example is performing data quality assessment by checking for the shape of the dataset, number of missing values, duplicates, outliers, and data imbalance.
Handling missing values:
Handling missing values in machine learning is an important preprocessing step that is essential for building accurate and reliable models. Missing values can occur for various reasons, such as data entry errors, sensor failures, or simply because certain data points were not collected.
Here are some common strategies for handling missing values in machine learning:
Eventually, the choice of how to handle missing values depends on the particular context and the nature of the missing values. It is critical to carefully consider the preferences and drawbacks of each approach and to select the one that's most suitable for the issue at hand.
#checking for the shape of the dataset iris_df.shape #checking for number of missing values iris_df.isnull().sum() #replacing the missing values with the median of the feature iris_df = iris_df.fillna(iris_df.median()) #verifying that there is no missing data iris_df.isnull().sum()
This code is performing handling missing values by first checking for the number of missing values, then replacing the missing values with the median of the feature, and finally verifying that there is no missing data.
Outliers are data points that are significantly different from the rest of the data. Handling outliers in machine learning is the process of identifying and treating outliers in the dataset. This can be done by either dropping the outliers or transforming them. Dropping the outliers means removing the data points that are considered outliers from the dataset. Transforming the outliers means changing the outlier values to make them more consistent with the rest of the data.But how to check for outliers?
We can check for outliers by using below methods:
import seaborn as sns import matplotlib.pyplot as plt sns.boxplot(x="variable", y="value", data=pd.melt(iris_df)) plt.title("Box plot of Iris Dataset") plt.xlabel("Variable") plt.ylabel("Value") plt.show()
This code creates a box plot for each variable (sepal length, sepal width, petal length, and petal width) in the iris dataset. Outliers can be identified as individual data points that fall outside the whiskers of the box plot. You can visually inspect the box plots to identify any outliers.
#removing the outliers using z-score from scipy import stats iris_df_z = iris_df[(np.abs(stats.zscore(iris_df)) < 3).all(axis=1)] # verify that the outliers have been removed iris_df_z.shape
This code is performing handling outliers by calculating the z-score of the dataset and then removing any data points with a z-score greater than 3. This ensures that any outliers are removed from the dataset.
#importing the library from sklearn.preprocessing import StandardScaler #creating an instance of the StandardScaler scaler = StandardScaler() #transforming the data iris_df_z_transformed = scaler.fit_transform(iris_df_z)
This code example is performing data transformation by using the StandardScaler from the scikit-learn library. The StandardScaler is used to transform the data by scaling it to have a mean of 0 and standard deviation of 1. The data is then stored in the variable iris_df_z_transformed.
Data merging and deduplication:
Data merging and deduplication in machine learning is the process of combining two or more datasets into one and removing any duplicate data points. This is done to ensure that the data used to build the machine learning model is accurate and complete. Data merging involves combining datasets in a way that preserves the integrity of the data, while deduplication involves identifying and removing any duplicate data points from the dataset.
#importing libraries import pandas as pd #loading the datasets iris_data1 = pd.read_csv('iris_data1.csv') iris_data2 = pd.read_csv('iris_data2.csv') #merging the datasets iris_df = pd.concat([iris_data1, iris_data2], axis=0, ignore_index=True) #checking the shape of the merged dataset iris_df.shape #removing the duplicates iris_df = iris_df.drop_duplicates() #verifying that the duplicates have been removed iris_df.shape
This code is used to perform data merging and deduplication on two datasets, 'iris_data1' and 'iris_data2'. The datasets are first merged into a single dataframe, 'iris_df', using the concat() method. Then, the duplicates are removed from the merged dataset using the drop_duplicates() method. Finally, the shape of the dataset is verified to ensure that the duplicates have been successfully removed.
Handling categorical variables:
Dealing with categorical variables is the method of changing categorical data into numerical data. This can be done in order to create the information more appropriate for machine learning algorithms, since most machine learning algorithms work with numerical information. This can be done by utilizing methods such as one-hot encoding, label encoding, and binary encoding.
#checking the data type of the target iris_data.target.dtype #converting the target to categorical iris_data['target'] = iris_data['target'].astype('category') #recoding the target to 0 and 1 iris_data['target'] = iris_data['target'].cat.codes #verifying the data type of the target iris_data.target.dtype
This code is used to handle categorical variables in a dataset. The target column is converted from a numerical data type to a categorical data type and then recoded to 0 and 1. This is useful for machine learning algorithms that require categorical data to be represented as numerical values.
Best practices and guidelines for data cleaning:
Data cleaning is an critical step within the handle of machine learning. It includes evaluating the quality of information, dealing with missing values, taking care of outliers, transforming data, merging and deduplicating data, and dealing with categorical variables.By implementing these best practices and guidelines, we can ensure that our dataset is clean and ready for machine learning algorithms.
Answer:a. To identify and remove errors in data
Answer:c. Replace the missing values with the median
Answer:d. Remove the outliers using z-score
Answer:c. Add more data
Related Tutorials to watch
Top Articles toRead