Bytes
Data Science

KNN Algorithm in Python: Implementation with Examples

icon

Harshini Bhat

Data Science Consultant at almaBetter

people8 mins

people2369

Published on26 Sep, 2023

The K-nearest neighbors (KNN) algorithm in stands as one of the foundational pillars of Machine Learning. Its simplicity, interpretability, and remarkable versatility have propelled it to popularity in the field. By drawing on the principle of "birds of a feather flock together," KNN offers a straightforward yet powerful approach to classification and regression tasks.

In the real-world landscape of data analysis,k-nearest neighbor algorithm python has found its application in diverse domains, from healthcare and finance to recommendation systems and image recognition. Its ability to make predictions based on the similarity between data points makes it a valuable tool in tackling complex problems.

In this article, we will go through the inner workings of the KNN algorithm in Machine Learning in Python and demonstrate its practical implementation. Understanding KNN classifier  mechanics and harnessing its capabilities in Python will equip you with a valuable skillset for a wide array of machine learning challenges.

Understanding the K-Nearest Neighbors Algorithm

K-Nearest Neighbors (KNN) is a straightforward and intuitive machine learning algorithm used for both classification and regression tasks. At its core, KNN relies on the principle that similar data points tend to be close to each other in the feature space. Let's delve into the fundamental concepts of KNN with knn code in python:

What is KNN?

KNN is a supervised learning algorithm that makes predictions based on the majority class or average value of its "K" nearest data points in the training dataset.It is a non-parametric algorithm, meaning it doesn't make any assumptions about the underlying data distribution.

How does it work?

1. Data Points and Features: KNN operates on a dataset consisting of data points, each with multiple features or attributes. These data points can be thought of as points in a multidimensional space, where each feature represents a dimension.

2. Choosing K: The first step is to choose the value of "K," which represents the number of nearest neighbors to consider when making predictions. Typically, "K" is an odd number to avoid ties when classifying data points.

3. Distance Calculation: KNN relies on distance metrics to measure the similarity or dissimilarity between data points.Common distance metrics include Euclidean distance, Manhattan distance, and cosine similarity.

Euclidean distance, the most commonly used metric, is calculated as follows for two points, p and q, with n features:

Eucledean Distance

Eucledean Distance

4. Finding the Neighbors: For a given data point (the one we want to make a prediction for), KNN calculates the distance between this point and all other points in the training dataset.

5. Selecting K Nearest Neighbors: KNN then selects the "K" data points with the smallest distances to the point being predicted. These "K" data points become the nearest neighbors.

  • Classification (for Classification Tasks):

In classification tasks, KNN counts the class labels of the "K" nearest neighbours.

The class label that occurs most frequently among these neighbors is assigned to the new data point.

Example: If out of the "K" neighbors, 5 belong to Class A and 3 belong to Class B, the prediction for the new data point is Class A.

  • Regression (for Regression Tasks):

In regression tasks, KNN calculates the average (or weighted average) of the target values of the "K" nearest neighbors. This average is assigned as the prediction for the new data point.

Example: If the target values of the "K" neighbors are [10, 12, 15, 8, 9], the prediction for the new data point is the average of these values, which is 10.8.

KNN Algorithm

KNN Algorithm

The Concept of "K" (Number of Neighbors):

In KNN, "K" represents the number of nearby data points considered when making predictions.

Choosing "K" is crucial:

  • Smaller "K" (e.g., 1 or 3) leads to a sensitive model, capturing fine details but susceptible to noise and outliers.
  • Larger "K" (e.g., 10 or 20) results in a smoother, more stable model, but may overlook local variations.

The optimal "K" depends on the dataset and problem; it's often determined through techniques like cross-validation.

Selecting the right "K" balances sensitivity and robustness, impacting KNN's accuracy and adaptability.

Distance Metrics (e.g., Euclidean Distance) Used in KNN:

Purpose: KNN employs distance metrics to gauge the similarity or dissimilarity between data points.

Common Metrics: Some common distance metrics in KNN are:

  • Euclidean Distance: This is the most widely used metric in KNN. It measures the straight-line distance between two points in a multidimensional space. The formula for Euclidean distance between points p and q with  2 features as follows.

  • Manhattan Distance: This metric calculates the distance by summing the absolute differences between the corresponding feature values. It is often used when movement can only occur along gridlines (e.g., in a city block grid).

  • Cosine Similarity: Rather than measuring distance, cosine similarity quantifies the cosine of the angle between two vectors. It's particularly useful when considering the orientation of data points in a high-dimensional space, such as in text analysis.

Eucledian and Manhattan Distance

Eucledian and Manhattan Distance

These distance metrics help KNN determine the proximity of data points, aiding in neighbor selection and, subsequently, prediction. The choice of distance metric can impact the algorithm's performance, making it an essential consideration when implementing KNN.

Pros and Cons of the KNN Algorithm:

Pros of KNN Algorithm Python:

  • Simplicity: KNN is easy to understand and implement, making it an excellent choice for beginners.
  • Versatility: It can be used for classification and regression tasks and adapts well to various types of data.
  • No training phase: KNN doesn't require a separate training phase, as it stores the entire dataset for predictions.

Cons of KNN Algorithm Python:

  • Computational Complexity: As the dataset grows, the computational cost of KNN increases significantly because it involves calculating distances between data points.
  • Sensitivity to Outliers: KNN can be sensitive to outliers in the data, leading to suboptimal performance.
  • Curse of Dimensionality: In high-dimensional spaces, KNN may struggle to find meaningful neighbors due to the "curse of dimensionality."

Understanding these fundamental concepts and the pros and cons of KNN is crucial for effectively implementing and using the algorithm in real-world machine learning tasks.

Complete implementation of KNN Algorithm in Python

We'll use Python and the scikit-learn library for this implementation. Here is the implementation of  KNN algorithm in python- a basic knn classifier python code

Step 1: Loading the Dataset

from sklearn import datasets
from sklearn.model_selection import train_test_split

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data  # Features
y = iris.target  # Target variable (classes)

# Split the dataset into a training set and a testing set (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Step 2: Choosing the Value of K

# You can experiment with different values of K to find the optimal one through cross-validation.

Step 3: Calculating Distances

from sklearn.metrics import pairwise_distances

# For simplicity, we'll use Euclidean distance as the distance metric
distances = pairwise_distances(X_test, X_train, metric='euclidean')

Step 4: Finding the K-Nearest Neighbors

import numpy as np

def find_k_nearest_neighbors(distances, k):
    # Find the indices of the K-nearest neighbors for each test sample
    k_nearest_neighbors_indices = np.argsort(distances, axis=1)[:, :k]
    return k_nearest_neighbors_indices

k = 5  # Choose the value of K
k_nearest_neighbors_indices = find_k_nearest_neighbors(distances, k)

Step 5: Making Predictions

from scipy.stats import mode

def predict_majority_class(k_nearest_neighbors_indices, y_train):
    # Get the classes of the K-nearest neighbors
    k_nearest_classes = y_train[k_nearest_neighbors_indices]
    
    # Predict the class based on the majority class among the neighbors
    predictions, _ = mode(k_nearest_classes, axis=1)
    return predictions.ravel()

y_pred = predict_majority_class(k_nearest_neighbors_indices, y_train)

Step 6: Evaluating the Model's Performance

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Tuning and Optimizing the KNN Model

Utilizing optimization techniques in the K-Nearest Neighbors (KNN) algorithm is essential as it enhances model performance, mitigates overfitting, ensures robustness across datasets, handles complex data patterns, addresses imbalanced data, promotes generalization, improves model interpretability, and optimizes computational resources. These techniques help fine-tune KNN's hyperparameters, distance metrics, and data preprocessing steps, enabling it to adapt effectively to diverse data scenarios and deliver more accurate and reliable predictions.

Fine-tuning and optimizing the K-Nearest Neighbors (KNN) algorithm involves several key considerations and techniques to ensure it performs well in various scenarios. Here are methods to fine-tune and optimize KNN:

  • Cross-Validation: Use techniques like k-fold cross-validation to assess the model's performance across different values of K. Choose the K that provides the best balance between bias and variance. Common values to try are odd numbers to avoid ties.

  • Grid Search: Perform a grid search with a range of K values and other hyperparameters to find the best combination that optimizes model performance.

  • Handling Imbalanced Datasets: In real-world datasets, classes may be imbalanced, where some classes have significantly more samples than others. To address this:

  • Resampling: You can oversample the minority class or undersample the majority class to balance the dataset.

  • Use Weighted KNN: Some KNN implementations allow you to assign different weights to different neighbors, giving more importance to closer neighbors or those from the minority class.

  • Normalize or standardize the features before applying KNN. Feature scaling ensures that all features have the same influence on distance calculations. Common methods include Min-Max scaling and z-score normalization.

  • Outliers can significantly affect KNN's performance. Consider outlier detection and removal techniques to mitigate their impact on distance-based calculations.

  • High-dimensional data can lead to the curse of dimensionality and adversely affect KNN's performance. Consider dimensionality reduction techniques such as Principal Component Analysis (PCA) or feature selection to reduce the number of features while preserving essential information.

Conclusion

In this article, we've explored the K-Nearest Neighbors (KNN) algorithm, its key concepts, and its implementation in Python. We've emphasised the significance of choosing the right value of K, experimenting with distance metrics, and addressing challenges like imbalanced datasets. Visualisations, such as decision boundary plots and confusion matrices, provide valuable insights into the model's behavior. Implementing KNN in Python empowers data scientists and machine learning practitioners to tackle a wide range of classification and regression tasks. We encourage readers to experiment with KNN with python on their datasets and harness the algorithm's power for their machine-learning projects.

Recommended Courses
Masters in CS: Data Science and Artificial Intelligence
Course
20,000 people are doing this course
Join India's only Pay after placement Master's degree in Data Science. Get an assured job of 5 LPA and above. Accredited by ECTS and globally recognised in EU, US, Canada and 60+ countries.
Certification in Full Stack Data Science and AI
Course
20,000 people are doing this course
Become a job-ready Data Science professional in 30 weeks. Join the largest tech community in India. Pay only after you get a job above 5 LPA.

AlmaBetter’s curriculum is the best curriculum available online. AlmaBetter’s program is engaging, comprehensive, and student-centered. If you are honestly interested in Data Science, you cannot ask for a better platform than AlmaBetter.

avatar
Kamya Malhotra
Statistical Analyst
Fast forward your career in tech with AlmaBetter

Vikash SrivastavaCo-founder & CPTO AlmaBetter

Vikas CTO
AlmaBetter
Made with heartin Bengaluru, India
  • Official Address
  • 4th floor, 133/2, Janardhan Towers, Residency Road, Bengaluru, Karnataka, 560025
  • Communication Address
  • 4th floor, 315 Work Avenue, Siddhivinayak Tower, 152, 1st Cross Rd., 1st Block, Koramangala, Bengaluru, Karnataka, 560034
  • Follow Us
  • facebookinstagramlinkedintwitteryoutubetelegram

© 2023 AlmaBetter