bytes

tutorials

data science

k nearest neighbors

Module - 6 Non-Linear Model

Lesson - 1 K-Nearest Neighbors(KNN)

**Overview**

K-Nearest Neighbors (KNN) is a supervised machine learning algorithm that is used for both classification and regression. The algorithm is based on the idea that the data points that are closest to a given data point are the most likely to be similar to it. KNN works by finding the k-nearest points in the training data set and then using the labels of those points to predict the label of the given data point. KNN is considered an instance-based learning algorithm, since it stores the training data and makes predictions based on the stored data points.

**Introduction to K-Nearest Neighbors (KNN)**

K-nearest neighbors (kNN) is a supervised machine learning algorithm that can be used to solve both classification and regression tasks. kNN as an algorithm seems to be inspired from real life. People tend to be effected by the people around them.

It works in a similar fashion to humans as our behaviour is guided by the friends we grew up with or from our surroundings. If you grow up with people who love sports, it is highly likely that you will end up loving sports. Our parents also shape our personality in some ways. There are of-course exceptions.

- The value of a data point is determined by the data points around it. If you have one very close friend and spend most of your time with him/her, you will end up sharing similar interests and enjoying same things. That is kNN with k=1.
- If you always hang out with a group of 5, each one in the group has an effect on your behavior and you will end up being the average of 5. That is kNN with k=5.

kNN classifier determines the class of a data point by majority voting principle. If k is set to 5, the classes of 5 closest points are checked. Prediction is done according to the majority class. Similarly, kNN regression takes the mean value of 5 closest points.

**KNN-Algorithm**

Load the data

Initialize K to your chosen number of neighbors’ and normalize the data.

For each example in the data

3.1. Calculate the distance between the query example and the current example from the data.

3.2. Add the distance and the index of the example to an ordered collection

Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances

Pick the first K entries from the sorted collection

Get the labels of the selected K entries

If regression, return the mean of the K labels

If classification, return the mode of the K labels

**Example:**

Suppose we have a dataset which can be plotted as follows −

Now, we need to classify new data point with black dot (at point 60,60) into blue or red class. We are assuming K = 3 i.e. it would find three nearest data points.

**How to select the best value of K**

**Example:**

Let’s say we have two groups of points —blue-circles and orange-triangles. We want to classify the Test Point = black circle with a question mark, as either a blue circle or an orange triangle.

For K = 1 we will look at the first nearest neighbor. Since we take majority vote and there is only 1 voter we assign its label to our black test point. We can see that the test point will be classified as a blue circle for k=1.

**If K is very Large:** The class label of the majority class of the training data set will be assigned to the rest data regardless of the class labels of the neighbours nearest to the test data.

**If K is very Small:** The class value of a noisy data/outlier in the training dataset which is the nearest neighbor to the test data will be assigned to the test data.

The best value of K is somewhere between these to extremes.

**Set K = square root of the number of training records.**

Test several K values on a variety of test data sets and choose the one that gives best performance.

**Distance metrics:**

**Euclidean Distance**

Euclidean Distance is a measure of the straight-line distance between two points in a Euclidean space. It is calculated by taking the square root of the sum of the squared differences between each coordinate of the two points. Mathematically, it is represented by the equation:

**Manhattan Distance**

Manhattan Distance, also known as Taxicab Distance, is a measure of the distance between two points in a rectangular grid. It is calculated by taking the sum of the absolute differences between the two points for each coordinate. Mathematically, it is represented by the equation:

**Feature scaling**

Feature scaling is a process used to normalize the range of independent variables or features of data. In machine learning, it is important to scale the features in the dataset prior to training a model, since the range of values of raw data varies widely. Scaling helps to reduce the time it takes to converge and improve the accuracy of the KNN model. Some popular feature scaling techniques include Min-Max scaling, Standardization, and Normalization.

**Normalization_vs_Standardization**

**Standardization:**This technique re-scales a feature to have a**mean**of 0 and a**standard deviation**of 1. The formula is given by

```
Xscaled = (X−Xmean)/S
```

where X is an original value, Xmean is the mean of all the values in the feature, and S is the standard deviation of all the values in the feature.

**2. Normalization:** This technique re-scales a feature or dataset to have a **range** between 0 and 1. The formula is given by

```
Xscaled = (X−Xmin)/(Xmax−Xmin)
```

where X is an original value, Xmin is the minimum value of X in the feature, and Xmax is the maximum value of X in the feature.

**Advantages and disadvantages of KNN**

**PROS**

- Easy to understand
- No assumptions about data
- Can be applied to both classification and regression
- Works easily on multi-class problems

**CONS**

- Memory Intensive / Computationally expensive
- Sensitive to scale of data
- Struggle when high number of independent variables

**Implementing KNN in Python**

For this example, we will use the classic Iris dataset which contains measurements for 150 flowers from three different species: Setosa, Versicolor, and Virginica. The dataset contains four features: sepal length, sepal width, petal length, and petal width.

Link: https://www.kaggle.com/datasets/uciml/iris

```
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
# Load the dataset
iris_dataset = pd.read_csv('iris.csv')
# Split the dataset into features and labels
X = iris_dataset.iloc[:, :-1].values
y = iris_dataset.iloc[:, -1].values
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Create the KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
# Train the model using the training sets
knn.fit(X_train, y_train)
# Predict the response for test dataset
y_pred = knn.predict(X_test)
# Evaluate the performance of the model
accuracy = knn.score(X_test, y_test)
print('Accuracy of KNN algorithm: {:.2f}'.format(accuracy))
```

This code implements the K-Nearest Neighbors (KNN) algorithm on the Iris dataset. First, the required libraries are imported. Then, the dataset is loaded and split into features (X) and labels (y). The dataset is then split into a training and test set. The KNN classifier is then initialized and the model is trained using the training set. Finally, the model is used to make predictions on the test set and the accuracy is evaluated.

**Real-world applications of KNN**

**Healthcare**: KNN can be used to predict diseases based on the patient's medical records, such as age, sex, blood pressure, cholesterol, etc.**Finance**: KNN can be used to detect fraudulent transactions and classify customers based on their spending habits.**Marketing**: KNN can be used to recommend products to customers based on their past purchases and preferences.**Security**: KNN can be used for facial recognition and intrusion detection.**Image Processing:**KNN can be used for object recognition and image classification.

**Conclusion**

After using the K Nearest Neighbors machine learning algorithm, the retail store was able to more accurately identify customers who were likely to purchase a particular product based on their past purchasing behavior. This allowed them to better target customers with the right products and offers, leading to increased sales and profits.

**Key takeaways**

- KNN is a supervised learning algorithm used for both classification and regression.
- KNN stores the entire training dataset which it uses to predict the output for any new data point.
- The output is a class membership (predicted target value).
- KNN is a non-parametric and lazy learning algorithm.
- KNN is a distance-based algorithm which uses the distance of a data point from the training data points to classify it.
- KNN performs better if the data is normalized to bring all the features to the same scale.
- KNN works best on small datasets and can be computationally expensive on large datasets.
- KNN is highly affected by outliers and noisy data.

**Quiz**

**1.Which of the following is a disadvantage of KNN?**

- High training time
- Poor generalization
- High complexity
- All of the above

**Answer**: D. All of the Above

**2.Which of the following is a key feature of KNN?**

- Non-parametric learning
- Parametric learning
- Both A and B
- None of the Above

**Answer**: A. Non-parametric learning

**3.Which of the following is an advantage of KNN?**

- Low bias
- High variance
- Low complexity
- All of the Above

**Answer**: D. All of the Above

**4.What type of learning algorithm is KNN?**

- Supervised learning
- Unsupervised learning
- Reinforcement learning
- Both A and B

**Answer**: A. Supervised learning

Made with

in Bengaluru, India - Join AlmaBetter
- Sign Up
- Become A Coach
- Coach Login

- Contact Us
- admissions@almabetter.com
- 08046008400

- Location
- 4th floor, 133/2, Janardhan Towers, Residency Road, Bengaluru, Karnataka, 560025

- Follow Us

© 2022 AlmaBetter