bytes

tutorials

data science

data cleaning in data science

Module - 2 Data preparation and EDA

Lesson - 1 Data cleaning

**Overview**

Data cleaning is the method of preparing a dataset for machine learning algorithms. It includes evaluating the quality of information, taking care of missing values, taking care of outliers, transforming data, merging and deduplicating data, and handling categorical variables. This basic process is required to ensure if the information is ready for machine learning algorithms, as it helps to diminish the hazard of blunders and enhances the accuracy of the models.

**Data quality assessment:**

Data merging and deduplication in machine learning is the method of combining two or more datasets into one and expelling any duplicate data points. Usually done to guarantee that the information utilized to construct the machine learning models is accurate and complete. Data merging includes combining datasets to preserve the integrity of the information, whereas deduplication includes recognizing and evacuating any duplicate data points from the dataset.

**Example**

Lets consider iris dataset :

```
#importing libraries
import pandas as pd
from sklearn.datasets import load_iris
#loading the dataset
iris_data = load_iris()
#storing the data into pandas dataframe
iris_df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
#checking for the shape of the dataset
iris_df.shape
#checking for number of missing values
iris_df.isnull().sum()
#checking for duplicates
iris_df.duplicated().sum()
#checking for outliers
iris_df.describe()
#checking for data imbalance
iris_df.groupby('target').size()
```

The code example is performing data quality assessment by checking for the shape of the dataset, number of missing values, duplicates, outliers, and data imbalance.

**Handling missing values:**

Handling missing values in machine learning is an important preprocessing step that is essential for building accurate and reliable models. Missing values can occur for various reasons, such as data entry errors, sensor failures, or simply because certain data points were not collected.

Here are some common strategies for handling missing values in machine learning:

**Deletion:**The simplest approach to handling missing values is to remove the rows or columns that contain them. This can be done using the following techniques:**Listwise deletion:**This involves removing any row that contains a missing value. This approach is easy to implement but can result in significant data loss.**Pairwise deletion:**This involves removing any row that contains a missing value for a specific variable. This approach retains more data but can introduce bias in the analysis.

**Imputation:**Another approach to handling missing values is to impute or estimate the missing values. Here are some commonly used imputation techniques:**Mean/median imputation:**This involves replacing the missing values with the mean or median value of the non-missing values for that variable. This approach is simple to implement but can result in biased estimates if the data is not normally distributed.**Mode imputation:**This involves replacing the missing values with the mode (most frequent value) of the non-missing values for that variable. This approach is suitable for categorical variables.**Regression imputation:**This involves using a regression model to predict the missing values based on the values of other variables. This approach is more sophisticated than mean/median imputation and can produce more accurate estimates, but it requires more computational resources.**Multiple imputation:**This involves creating several imputed datasets based on different imputation models and then combining the results to obtain a single estimate. This approach can produce more accurate results than single imputation methods.

**Encoding missing values:**Sometimes, missing values themselves can be informative, and removing or imputing them can lead to biased results. In such cases, it may be preferable to encode missing values as a separate category or using a special value such as -999 or NaN, depending on the programming language or environment used. This approach allows the missing values to be retained in the dataset and can be useful in some scenarios.

Eventually, the choice of how to handle missing values depends on the particular context and the nature of the missing values. It is critical to carefully consider the preferences and drawbacks of each approach and to select the one that's most suitable for the issue at hand.

**Example**

```
#checking for the shape of the dataset
iris_df.shape
#checking for number of missing values
iris_df.isnull().sum()
#replacing the missing values with the median of the feature
iris_df = iris_df.fillna(iris_df.median())
#verifying that there is no missing data
iris_df.isnull().sum()
```

This code is performing handling missing values by first checking for the number of missing values, then replacing the missing values with the median of the feature, and finally verifying that there is no missing data.

**Handling outliers:**

Outliers are data points that are significantly different from the rest of the data. Handling outliers in machine learning is the process of identifying and treating outliers in the dataset. This can be done by either dropping the outliers or transforming them. Dropping the outliers means removing the data points that are considered outliers from the dataset. Transforming the outliers means changing the outlier values to make them more consistent with the rest of the data.But how to check for outliers?

We can check for outliers by using below methods:

**Visual inspection:**One of the best and most viable ways to distinguish outliers is to outwardly assess the information utilizing plots such as box plots, histograms, and scatterplots. Outliers can regularly be recognized as data points that are found far away from the bulk of the information. Box plots are particularly useful for identifying outliers as they show the distribution of the data and highlight any extreme values.**Statistical methods:**There are a few statistical methods for identifying outliers, such as the Z-score and interquartile range (IQR) method. The Z-score strategy includes calculating the standard deviation of the data and identifying any data points that fall outside a certain number of standard deviations from the mean. The IQR strategy includes calculating the difference between the 75th and 25th percentiles of the data (i.e., the IQR), and distinguishing any data points that fall outside a certain numerous of the IQR from the median.

**Example**

```
import seaborn as sns
import matplotlib.pyplot as plt
# Create box plots for each variable
sns.boxplot(x="variable", y="value", data=pd.melt(iris_df))
# Set plot title and labels
plt.title("Box plot of Iris Dataset")
plt.xlabel("Variable")
plt.ylabel("Value")
# Display the plot
plt.show()
```

This code creates a box plot for each variable (sepal length, sepal width, petal length, and petal width) in the iris dataset. Outliers can be identified as individual data points that fall outside the whiskers of the box plot. You can visually inspect the box plots to identify any outliers.

```
#removing the outliers using z-score
from scipy import stats
iris_df_z = iris_df[(np.abs(stats.zscore(iris_df)) < 3).all(axis=1)]
# verify that the outliers have been removed
iris_df_z.shape
```

This code is performing handling outliers by calculating the z-score of the dataset and then removing any data points with a z-score greater than 3. This ensures that any outliers are removed from the dataset.

**Data transformation:**

- Data transformation in machine learning is the process of cleaning, transforming, and normalizing the data in order to make it suitable for use in a machine learning algorithm.
- Data transformation involves removing noise, removing duplicates, imputing missing values, encoding categorical variables, and scaling numeric variables.
- Data transformation is an important step in preprocessing the data before it is used for training a machine learning model. We can scale our data to perform data transformation.
- Data scaling or normalization is the process of transforming data to a standard scale or range to improve the performance and accuracy of machine learning models. Here are some reasons why data scaling is important:
- Many machine learning algorithms are sensitive to the scale of the input data. For example, distance-based algorithms such as K-Nearest Neighbors and clustering algorithms can be heavily influenced by differences in scale, leading to incorrect predictions or clustering.
- Data scaling can improve the convergence rate of some optimization algorithms, such as gradient descent, leading to faster training and better model performance.
- Some regularization techniques, such as L1 and L2 regularization, assume that the input data is standardized or normalized.

- Here are some common methods for scaling data in Python:
**Min-Max Scaling:**This method scales the data to a fixed range, usually between 0 and 1.**Standardization**: This method scales the data to have a mean of 0 and standard deviation of 1.**Robust Scaling:**This method scales the data to have a median of 0 and interquartile range (IQR) of 1.

**Example**

```
#importing the library
from sklearn.preprocessing import StandardScaler
#creating an instance of the StandardScaler
scaler = StandardScaler()
#transforming the data
iris_df_z_transformed = scaler.fit_transform(iris_df_z)
```

This code example is performing data transformation by using the StandardScaler from the scikit-learn library. The StandardScaler is used to transform the data by scaling it to have a mean of 0 and standard deviation of 1. The data is then stored in the variable iris_df_z_transformed.

**Data merging and deduplication:**

Data merging and deduplication in machine learning is the process of combining two or more datasets into one and removing any duplicate data points. This is done to ensure that the data used to build the machine learning model is accurate and complete. Data merging involves combining datasets in a way that preserves the integrity of the data, while deduplication involves identifying and removing any duplicate data points from the dataset.

**Example**

```
#importing libraries
import pandas as pd
#loading the datasets
iris_data1 = pd.read_csv('iris_data1.csv')
iris_data2 = pd.read_csv('iris_data2.csv')
#merging the datasets
iris_df = pd.concat([iris_data1, iris_data2], axis=0, ignore_index=True)
#checking the shape of the merged dataset
iris_df.shape
#removing the duplicates
iris_df = iris_df.drop_duplicates()
#verifying that the duplicates have been removed
iris_df.shape
```

This code is used to perform data merging and deduplication on two datasets, 'iris_data1' and 'iris_data2'. The datasets are first merged into a single dataframe, 'iris_df', using the concat() method. Then, the duplicates are removed from the merged dataset using the drop_duplicates() method. Finally, the shape of the dataset is verified to ensure that the duplicates have been successfully removed.

**Handling categorical variables:**

Dealing with categorical variables is the method of changing categorical data into numerical data. This can be done in order to create the information more appropriate for machine learning algorithms, since most machine learning algorithms work with numerical information. This can be done by utilizing methods such as one-hot encoding, label encoding, and binary encoding.

**Example**

```
#checking the data type of the target
iris_data.target.dtype
#converting the target to categorical
iris_data['target'] = iris_data['target'].astype('category')
#recoding the target to 0 and 1
iris_data['target'] = iris_data['target'].cat.codes
#verifying the data type of the target
iris_data.target.dtype
```

This code is used to handle categorical variables in a dataset. The target column is converted from a numerical data type to a categorical data type and then recoded to 0 and 1. This is useful for machine learning algorithms that require categorical data to be represented as numerical values.

**Best practices and guidelines for data cleaning:**

- Check for any missing data and handle it appropriately.
- Check for outliers and handle them appropriately.
- Check for any data imbalance and handle it appropriately.
- Convert any categorical variables to numerical values.
- Normalize or standardize the features of the dataset.
- Split the dataset into training, validation and test datasets.
- Remove any redundant features from the dataset.
- Ensure the dataset is in the correct format for the machine learning algorithm.

**Conclusion**

Data cleaning is an critical step within the handle of machine learning. It includes evaluating the quality of information, dealing with missing values, taking care of outliers, transforming data, merging and deduplicating data, and dealing with categorical variables.By implementing these best practices and guidelines, we can ensure that our dataset is clean and ready for machine learning algorithms.

**Key takeaways**

- Ensure that your dataset has the right amount of data points and no missing values.
- Check for outliers and eliminate them using z-score or other methods.
- Perform data transformation to normalize the data and bring it to a common scale.
- Merge and deduplicate data from different sources to create a comprehensive dataset.
- Handle categorical variables by converting them to numerical values.

**Quiz**

**What is the purpose of data cleaning?**- To identify and remove errors in data
- To provide more accurate data
- To make the data easier to understand
- To reduce the size of the data

**Answer**:a. To identify and remove errors in data

**What is the best way to handle missing values in data?**- Delete the rows with missing values
- Replace the missing values with the mean
- Replace the missing values with the median
- Replace the missing values with zero

**Answer**:c. Replace the missing values with the median

**What is the best way to handle outliers in data?**- Delete the rows with outliers
- Replace the outliers with the mean
- Replace the outliers with the median
- Remove the outliers using z-score

**Answer**:d. Remove the outliers using z-score

**Which of the following is not a best practice for data cleaning?**- Remove duplicates
- Transform data
- Add more data
- Replace missing values with the mean

**Answer**:c. Add more data

Made with

in Bengaluru, India - Join AlmaBetter
- Sign Up
- Become A Coach
- Coach Login

- Contact Us
- admissions@almabetter.com
- 08046008400

- Location
- 4th floor, 133/2, Janardhan Towers, Residency Road, Bengaluru, Karnataka, 560025

- Follow Us

© 2022 AlmaBetter