Harshini Bhat
Data Science Consultant at almaBetter
8 mins
2369
The K-nearest neighbors (KNN) algorithm in stands as one of the foundational pillars of Machine Learning. Its simplicity, interpretability, and remarkable versatility have propelled it to popularity in the field. By drawing on the principle of "birds of a feather flock together," KNN offers a straightforward yet powerful approach to classification and regression tasks.
In the real-world landscape of data analysis,k-nearest neighbor algorithm python has found its application in diverse domains, from healthcare and finance to recommendation systems and image recognition. Its ability to make predictions based on the similarity between data points makes it a valuable tool in tackling complex problems.
In this article, we will go through the inner workings of the KNN algorithm in Machine Learning in Python and demonstrate its practical implementation. Understanding KNN classifier mechanics and harnessing its capabilities in Python will equip you with a valuable skillset for a wide array of machine learning challenges.
K-Nearest Neighbors (KNN) is a straightforward and intuitive machine learning algorithm used for both classification and regression tasks. At its core, KNN relies on the principle that similar data points tend to be close to each other in the feature space. Let's delve into the fundamental concepts of KNN with knn code in python:
KNN is a supervised learning algorithm that makes predictions based on the majority class or average value of its "K" nearest data points in the training dataset.It is a non-parametric algorithm, meaning it doesn't make any assumptions about the underlying data distribution.
1. Data Points and Features: KNN operates on a dataset consisting of data points, each with multiple features or attributes. These data points can be thought of as points in a multidimensional space, where each feature represents a dimension.
2. Choosing K: The first step is to choose the value of "K," which represents the number of nearest neighbors to consider when making predictions. Typically, "K" is an odd number to avoid ties when classifying data points.
3. Distance Calculation: KNN relies on distance metrics to measure the similarity or dissimilarity between data points.Common distance metrics include Euclidean distance, Manhattan distance, and cosine similarity.
Euclidean distance, the most commonly used metric, is calculated as follows for two points, p and q, with n features:
Eucledean Distance
4. Finding the Neighbors: For a given data point (the one we want to make a prediction for), KNN calculates the distance between this point and all other points in the training dataset.
5. Selecting K Nearest Neighbors: KNN then selects the "K" data points with the smallest distances to the point being predicted. These "K" data points become the nearest neighbors.
In classification tasks, KNN counts the class labels of the "K" nearest neighbours.
The class label that occurs most frequently among these neighbors is assigned to the new data point.
Example: If out of the "K" neighbors, 5 belong to Class A and 3 belong to Class B, the prediction for the new data point is Class A.
In regression tasks, KNN calculates the average (or weighted average) of the target values of the "K" nearest neighbors. This average is assigned as the prediction for the new data point.
Example: If the target values of the "K" neighbors are [10, 12, 15, 8, 9], the prediction for the new data point is the average of these values, which is 10.8.
KNN Algorithm
In KNN, "K" represents the number of nearby data points considered when making predictions.
Choosing "K" is crucial:
The optimal "K" depends on the dataset and problem; it's often determined through techniques like cross-validation.
Selecting the right "K" balances sensitivity and robustness, impacting KNN's accuracy and adaptability.
Purpose: KNN employs distance metrics to gauge the similarity or dissimilarity between data points.
Common Metrics: Some common distance metrics in KNN are:
Euclidean Distance: This is the most widely used metric in KNN. It measures the straight-line distance between two points in a multidimensional space. The formula for Euclidean distance between points p and q with 2 features as follows.
Manhattan Distance: This metric calculates the distance by summing the absolute differences between the corresponding feature values. It is often used when movement can only occur along gridlines (e.g., in a city block grid).
Cosine Similarity: Rather than measuring distance, cosine similarity quantifies the cosine of the angle between two vectors. It's particularly useful when considering the orientation of data points in a high-dimensional space, such as in text analysis.
Eucledian and Manhattan Distance
These distance metrics help KNN determine the proximity of data points, aiding in neighbor selection and, subsequently, prediction. The choice of distance metric can impact the algorithm's performance, making it an essential consideration when implementing KNN.
Understanding these fundamental concepts and the pros and cons of KNN is crucial for effectively implementing and using the algorithm in real-world machine learning tasks.
We'll use Python and the scikit-learn library for this implementation. Here is the implementation of KNN algorithm in python- a basic knn classifier python code
Step 1: Loading the Dataset
from sklearn import datasets
from sklearn.model_selection import train_test_split
# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data # Features
y = iris.target # Target variable (classes)
# Split the dataset into a training set and a testing set (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Step 2: Choosing the Value of K
# You can experiment with different values of K to find the optimal one through cross-validation.
Step 3: Calculating Distances
from sklearn.metrics import pairwise_distances
# For simplicity, we'll use Euclidean distance as the distance metric
distances = pairwise_distances(X_test, X_train, metric='euclidean')
Step 4: Finding the K-Nearest Neighbors
import numpy as np
def find_k_nearest_neighbors(distances, k):
# Find the indices of the K-nearest neighbors for each test sample
k_nearest_neighbors_indices = np.argsort(distances, axis=1)[:, :k]
return k_nearest_neighbors_indices
k = 5 # Choose the value of K
k_nearest_neighbors_indices = find_k_nearest_neighbors(distances, k)
Step 5: Making Predictions
from scipy.stats import mode
def predict_majority_class(k_nearest_neighbors_indices, y_train):
# Get the classes of the K-nearest neighbors
k_nearest_classes = y_train[k_nearest_neighbors_indices]
# Predict the class based on the majority class among the neighbors
predictions, _ = mode(k_nearest_classes, axis=1)
return predictions.ravel()
y_pred = predict_majority_class(k_nearest_neighbors_indices, y_train)
Step 6: Evaluating the Model's Performance
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Utilizing optimization techniques in the K-Nearest Neighbors (KNN) algorithm is essential as it enhances model performance, mitigates overfitting, ensures robustness across datasets, handles complex data patterns, addresses imbalanced data, promotes generalization, improves model interpretability, and optimizes computational resources. These techniques help fine-tune KNN's hyperparameters, distance metrics, and data preprocessing steps, enabling it to adapt effectively to diverse data scenarios and deliver more accurate and reliable predictions.
Fine-tuning and optimizing the K-Nearest Neighbors (KNN) algorithm involves several key considerations and techniques to ensure it performs well in various scenarios. Here are methods to fine-tune and optimize KNN:
Cross-Validation: Use techniques like k-fold cross-validation to assess the model's performance across different values of K. Choose the K that provides the best balance between bias and variance. Common values to try are odd numbers to avoid ties.
Grid Search: Perform a grid search with a range of K values and other hyperparameters to find the best combination that optimizes model performance.
Handling Imbalanced Datasets: In real-world datasets, classes may be imbalanced, where some classes have significantly more samples than others. To address this:
Resampling: You can oversample the minority class or undersample the majority class to balance the dataset.
Use Weighted KNN: Some KNN implementations allow you to assign different weights to different neighbors, giving more importance to closer neighbors or those from the minority class.
Normalize or standardize the features before applying KNN. Feature scaling ensures that all features have the same influence on distance calculations. Common methods include Min-Max scaling and z-score normalization.
Outliers can significantly affect KNN's performance. Consider outlier detection and removal techniques to mitigate their impact on distance-based calculations.
High-dimensional data can lead to the curse of dimensionality and adversely affect KNN's performance. Consider dimensionality reduction techniques such as Principal Component Analysis (PCA) or feature selection to reduce the number of features while preserving essential information.
In this article, we've explored the K-Nearest Neighbors (KNN) algorithm, its key concepts, and its implementation in Python. We've emphasised the significance of choosing the right value of K, experimenting with distance metrics, and addressing challenges like imbalanced datasets. Visualisations, such as decision boundary plots and confusion matrices, provide valuable insights into the model's behavior. Implementing KNN in Python empowers data scientists and machine learning practitioners to tackle a wide range of classification and regression tasks. We encourage readers to experiment with KNN with python on their datasets and harness the algorithm's power for their machine-learning projects.