Module - 2 Data preparation and EDA

Lesson - 3 Introduction to Exploratory Data Analysis (EDA)

**Overview**

Exploratory Data Analysis (EDA) is a type of data analysis used to explore and understand the characteristics of a given data set. It is used to identify patterns, relationships, trends, and outliers within a given data set. EDA is often the first step in a machine learning project, as it helps to better understand the data and to determine what types of algorithms and models will be most effective. EDA can be used to look for correlations and trends in data, as well as to identify potential outliers. It is also used to compare different datasets and to identify patterns that can be used to develop better models.

**Implementation**

Let's consider the Iris dataset. This dataset contains 150 observations of four variables: sepal length, sepal width, petal length, and petal width.

Link: https://www.kaggle.com/datasets/saurabh00007/iriscsv

**Load the Data:**

The primary step in EDA is to load the data into a data analysis tool, such as Python with Pandas. It is vital to guarantee that the information is in a format that can be analyzed, such as a CSV or Excel file.

```
#Load the Data
import pandas as pd
data = pd.read_csv('iris.csv')
```

Here, we use the pandas library to read in the Iris dataset from a CSV file and store it in a dataframe.

**Check for Missing Values:**

Check on the off chance that there are any missing values within the information, as missing data can lead to biased or inaccurate results. Missing values can be taken care of by either removing the rows or columns with missing values, or by imputing the missing values utilizing different strategies.

```
#Check for Missing Values
data.isnull().sum()
```

Here, we use the isnull() method to check for any missing values in the dataset, which returns the number of missing values for each column.

**Understand the Variables:**

Understanding the variables in the dataset is important to identify potential issues and to determine the appropriate analysis techniques. Variables can be categorical, numerical, or ordinal. Categorical variables have a finite number of values, while numerical variables are continuous or discrete.

```
#Understand the Variables
data.info()
```

Here, we use the info() method to get information about the data, such as the data type of each variable.

**Analyze the Distribution of the Variables:**

Analyze the distribution of the variables in the dataset to understand the shape of the data, detect outliers, and identify potential issues such as skewness or multimodality. Histograms, density plots, and box plots are useful tools for visualizing the distribution of variables.

```
#Analyze the Distribution of the Variables
import matplotlib.pyplot as plt
data.hist()
plt.show()
```

Here, we use the hist() method to plot histograms for each variable in the dataset, which gives us a visual representation of the distribution of each variable.

**Identify Correlations:**

Correlations between variables can help identify relationships and dependencies in the data. Correlations can be measured using Pearson's correlation coefficient for numerical variables, and contingency tables for categorical variables.

```
#Identify Correlations
import seaborn as sns
corr = data.corr()
sns.heatmap(corr)
```

Here, we use the heatmap() method to plot a heatmap of the Pearson's correlation coefficients between each pair of variables in the dataset.

**Visualize Relationships:**

Visualizing relationships between variables can help identify patterns and anomalies in the data. Scatterplots and heatmaps are useful tools for visualizing relationships between numerical variables, while bar charts and stacked bar charts can be used for categorical variables.

```
#Visualize Relationships
sns.pairplot(data)
```

Here, we use the pairplot() method to plot scatterplots for the variables in the dataset, which gives us a visual representation of the relationships between each pair of variables.

**Identify Anomalies and Outliers:**

Identify anomalies and outliers in the data, as they can lead to biased or inaccurate results. Anomalies and outliers can be identified using statistical methods or by visual inspection of the data.

```
#Identify Anomalies and Outliers
for col in data.columns:
plt.boxplot(data[col])
plt.title(col)
plt.show()
```

**Summarize the Findings:**

Summarize the findings of the EDA in a report or presentation to communicate the key insights and recommendations to stakeholders.

```
#Summarize the Findings
print(data.describe())
```

Here, we use the describe() method to get a summary of the statistical properties of each variable in the dataset, such as the mean, standard deviation, and quartiles.

**Conclusion**

Generally, EDA is a critical step within the machine learning pipeline because it makes a difference to distinguish potential issues within the information and to choose suitable analysis techniques. By conducting EDA, machine learning specialists can progress the exactness and unwavering quality of their models.

**Key takeaways**

- Explore the Data: Get a basic understanding of the data by exploring its structure, summary statistics, and visualize it.
- Clean the Data: Remove any outliers, missing values, and duplicate data points that could skew the analysis.
- Transform the Data: Transform the data set into a form that is amenable for further analysis.
- Correlation Analysis: Explore the relationship between different variables and draw meaningful conclusions.
- Feature Selection: Identify important features that can be used in the model building process.
- Model Building: Use machine learning algorithms to build predictive models.

**Quiz**

**What is Exploratory Data Analysis (EDA)?**- A method of finding patterns in data
- A process of identifying data anomalies
- A technique of understanding data
- A method of building predictive models

**Answer**: c. A technique of understanding data

**What is the main purpose of EDA?**- To identify outliers
- To identify relationships between variables
- To predict future outcomes
- To identify trends in the data

**Answer**: b. To identify relationships between variables

**Which of the following is not a common approach used in EDA?**- Visualization
- Predictive modeling
- Data cleansing
- Feature engineering

**Answer**: b. Predictive modeling

**What is the goal of EDA?**- To identify correlations between variables
- To identify outliers
- To build a predictive model
- To identify trends in the data

**Answer**: a. To identify correlations between variables

Related Tutorials to watch

Top Articles toRead

Read

- Contact Us
- admissions@almabetter.com
- 08046008400

- Official Address
- 4th floor, 133/2, Janardhan Towers, Residency Road, Bengaluru, Karnataka, 560025

- Communication Address
- 4th floor, 315 Work Avenue, Siddhivinayak Tower, 152, 1st Cross Rd., 1st Block, Koramangala, Bengaluru, Karnataka, 560034

- Follow Us

© 2023 AlmaBetter