K-means clustering is a popular unsupervised machine learning algorithm that is used to group similar data points together. The algorithm works by iteratively partitioning data points into K clusters based on their similarity, where K is a pre-defined number of clusters that the algorithm aims to create.
Introduction to Clustering
Clustering algorithms seek to learn, from the properties of the data, an optimal division or discrete labeling of groups of points.Many clustering algorithms are available in Scikit-Learn and elsewhere, but perhaps the simplest to understand is an algorithm known as k-means clustering, which is implemented in sklearn.cluster.KMeans .
The *k-*means algorithm searches for a pre-determined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like:
Those two assumptions are the basis of the _k_means model
Expectation–maximization (E–M) is a powerful algorithm that comes up in a variety of contexts within data science. _k_-means is a particularly simple and easy-to-understand application of the algorithm.
In short, the expectation–maximization approach here consists of the following procedure:
Here the "E-step" or "Expectation step" is so-named because it involves updating our expectation of which cluster each point belongs to.
The "M-step" or "Maximization step" is so-named because it involves maximizing some fitness function that defines the location of the cluster centers — in this case, that maximization is accomplished by taking a simple mean of the data in each cluster.
The literature about this algorithm is vast, but can be summarized as follows: under typical circumstances, each repetition of the E-step and M-step will always result in a better estimate of the cluster characteristics.
Steps followed in K-means clustering
Here are the basic steps involved in K-means clustering:
Caveats of expectation–maximization
There are a few issues to be aware of when using the expectation–maximization algorithm.
Selecting the number of clusters with silhouette analysis on KMeans clustering
Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a range of [-1, 1].
Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster.
Calculation of Silhouette score
Silhouette score is used to evaluate the quality of clusters created using clustering algorithms such as K-Means in terms of how well samples are clustered with other samples that are similar to each other. The Silhouette score is calculated for each sample of different clusters. To calculate the Silhouette score for each observation/data point, the following distances need to be found out for each observations belonging to all the clusters:
The Silhouette Coefficient for a sample is
Implementation of K-means
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler #reading the data df = pd.read_csv('data.csv') #viewing the data df.head() #selecting the columns X = df.iloc[:, [3,4]].values #normalizing the data sc = StandardScaler() X = sc.fit_transform(X) #creating an instance of KMeans class kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 0) #fitting the data y_kmeans = kmeans.fit_predict(X) #visualizing the clusters plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1
K-means clustering is a widely used unsupervised machine learning algorithm that groups similar data points together based on their similarity. It involves iteratively partitioning data points into K clusters, where K is a pre-defined number of clusters. However, there are some caveats to be aware of, such as the need to select the number of clusters beforehand, and the possibility of not achieving the globally optimal result. Silhouette analysis can be used to assess the separation distance between resulting clusters and determine the optimal number of clusters.
Answer: b. An unsupervised machine learning algorithm used to group similar data points together
Answer: c. It involves updating our expectation of which cluster each point belongs to
Answer: b. When the centroids no longer change
Answer: c. [-1, 1]
Answer: b. No. The number of clusters must be selected beforehand.
Related Tutorials to watch
Top Articles toRead