Course Outline

What is Unsupervised Learning?

K-means Clustering

Hierarchical Clustering in Machine Learning

Metrics for Unsupervised Learning

Hierarchical Clustering in Machine Learning

Last Updated: 19th October, 2024

Hierarchical clustering could be a strategy of clustering data into groups or clusters based on their similarity. It may be a type of unsupervised learning, which implies that it does not require labeled information to create expectations.

In hierarchical clustering, the information is at first represented as a set of individual points, and the calculation continues by iteratively consolidating the closest sets of focuses or clusters until all the focuses have a place to a single cluster or to a indicated number of clusters.

There are two fundamental sorts of hierarchical clustering: agglomerative and divisive. The yield of hierarchical clustering is usually represented as a dendrogram, which may be a tree-like diagram that shows the various leveled connections between the clusters. The dendrogram can be utilized to distinguish the ideal number of clusters based on the structure of the tree.

Why Hierarchical Clustering?

K-means follows an iterative process. It will keep on running until the centroids of newly formed clusters do not change or the maximum number of iterations are reached.

But there are certain challenges with K-means. It always makes spherical clusters. Also, we have to decide the number of clusters at the beginning of the algorithm. Ideally, we would not know how many clusters should we have, in the beginning of the algorithm and hence it a challenge with K-means.

Hierarchical clustering takes away the problem of having to pre-define the number of clusters. So, let’s see what hierarchical clustering is and how it improves on K-means.

Agglomerative Clustering

Agglomerative clustering may be a sort of hierarchical clustering in which each information point begins as its claim cluster, and clusters are iteratively blended until a halting model is met.

At each iteration, the two closest clusters are merged into a new, bigger cluster. The distance between clusters can be measured employing an assortment of measurements, such as Euclidean distance, cosine similarity, or correlation coefficient.

The method proceeds until all information focuses have a place to a single cluster, or until a ceasing basis, such as a greatest number of clusters or a least remove limit, is met.

The yield of agglomerative clustering can be visualized as a dendrogram, which could be a tree-like diagram that shows the various leveled connections between the clusters. The tallness of the branches within the dendrogram speaks to the separate between clusters at each emphasis.

Agglomerative clustering could be a well known strategy for clustering information since it is simple to execute and can handle huge datasets but can be computationally expensive in case of expansive datasets.

Agglomerative Clustering Steps

Agglomerative clustering follows a bottom-up approach, where each data point starts as its own cluster, and clusters are merged step by step.

Initialize clusters: Each data point starts as its own cluster.
Calculate proximity matrix: Compute the distance between each pair of clusters (single, complete, average, or Ward linkage).
Merge clusters: Find the two closest clusters based on the proximity matrix and merge them into a single cluster.
Update the proximity matrix: Recalculate the distance between the new cluster and all remaining clusters.
Repeat: Continue merging the closest clusters until all points belong to a single cluster or the desired number of clusters is reached.
Output: Visualize the cluster formation through a dendrogram or select a point to cut the dendrogram for a desired number of clusters.

Agglomerative Clustering Python Implementation

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering


# Sample data
X = np.array([[5, 3], [10, 15], [15, 12], [24, 10], [30, 30], [85, 70], [71, 80], [60, 78], [70, 55]])


# Perform Agglomerative Clustering using scikit-learn
agg_clustering = AgglomerativeClustering(n_clusters=3, metric='euclidean', linkage='ward')
agg_clustering.fit(X)


# Display the cluster labels for each point
print(f"Cluster labels: {agg_clustering.labels_}")


# Plotting dendrogram for visualizing the hierarchical clusters
linked = linkage(X, 'ward')
plt.figure(figsize=(10, 7))
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.show()

Output:

Agglomerative Clustering Output

The AgglomerativeClustering from scikit-learn is used with the Ward linkage method.
metric='euclidean': This specifies the distance metric (Euclidean in this case).
linkage='ward': Ward's method minimizes the variance of the clusters being merged.
A dendrogram is created using scipy.cluster.hierarchy.dendrogram to visually represent the hierarchical structure.

Divisive Clustering

Divisive clustering is a type of hierarchical clustering in which all data points start in a single cluster and clusters are recursively divided until a stopping criterion is met.

At each iteration, the cluster with the highest variance or the greatest dissimilarity among its data points is split into two smaller clusters. The splitting can be performed using various algorithms such as k-means or hierarchical clustering.

The process continues until each cluster contains only one data point, or until a stopping criterion, such as a maximum number of clusters or a minimum variance threshold, is met.

The output of divisive clustering can also be visualized as a dendrogram, but in contrast to agglomerative clustering, the dendrogram is built from the bottom up, starting with individual data points and building up to the final clusters.

Divisive clustering can produce more meaningful clusters when the data has a clear structure, but it can be computationally expensive and sensitive to the choice of algorithm used for splitting the clusters.

Divisive Clustering Steps

Divisive clustering follows a top-down approach, starting with a single cluster that contains all data points and then recursively splits the clusters.

Initialize with all data in one cluster.
Calculate dissimilarity: Compute the variance or dissimilarity of points within the cluster.
Split the cluster: Identify the most dissimilar points and split the cluster into two sub-clusters. This can be done using methods such as K-means.
Update the clusters: Calculate the new dissimilarity or distance between data points in each sub-cluster.
Repeat: Continue recursively splitting the clusters until each data point forms its own cluster or a stopping criterion (like a fixed number of clusters) is met.

Divisive Clustering Python Implementation

There is no built-in divisive clustering algorithm in scikit-learn. Below is a sample example by simulating using recursive K-means clustering:

from sklearn.cluster import KMeans
import numpy as np


# Divisive Clustering Function
def divisive_clustering(X, n_clusters=2, depth=0):
    if len(X) <= n_clusters:
        return [X]  # Stop when we have fewer points than clusters


    # Apply KMeans clustering to divide the data into two clusters
    kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(X)
    labels = kmeans.labels_
   
    # Recursively apply divisive clustering to each cluster
    clusters = []
    for cluster_id in range(n_clusters):
        cluster_points = X[labels == cluster_id]
        if len(cluster_points) > 1:
            clusters += divisive_clustering(cluster_points, n_clusters)
        else:
            clusters.append(cluster_points)
   
    return clusters


# Sample data
X = np.array([[5, 3], [10, 15], [15, 12], [24, 10], [30, 30], [85, 70], [71, 80], [60, 78], [70, 55]])


# Apply Divisive Clustering
clusters = divisive_clustering(X, n_clusters=2)


# Display clusters
for i, cluster in enumerate(clusters):
    print(f"Cluster {i+1}: {cluster}")

Output:

Divisive Clustering Output

We simulate a top-down approach using KMeans to recursively split the dataset into smaller clusters.
The divisive_clustering function recursively splits the dataset until each point is in its own cluster (or other criteria like the number of clusters is met).
The function prints out the data points in each final cluster.

Key Differences Between Agglomerative and Divisive Clustering

Agglomerative clustering is more common and easier to implement. It starts with individual points and merges them.
Divisive clustering starts with a single cluster containing all data points and recursively splits it. It’s computationally more expensive because it requires repeated splitting.

Both approaches offer unique advantages and can be visualized using dendrograms to determine the structure and optimal number of clusters.

How Hierarchical Clustering Calculates Distance?

In hierarchical clustering, a proximity matrix is a square matrix that stores the distances between each pair of data points. It is used to determine the similarity or dissimilarity between data points or clusters and to merge or split them during the clustering process.

Here's an example of a proximity matrix for five data points:

Let’s take a sample of 5 students:

Student_ID	Marks
1	10
2	7
3	28
4	20
5	35

Creating a Proximity Matrix

First, we will create a proximity matrix which will tell us the distance between each of these points. Since we are calculating the distance of each point from each of the other points, we will get a square matrix of shape n X n (where n is the number of observations).

Let’s make the 5 x 5 proximity matrix for our example:

ID	1	2	3	4	5
1	0	3	18	10	25
2	3	0	21	13	28
3	18	21	0	8	7
4	10	13	8	0	15
5	25	28	7	15	0

The diagonal elements of this matrix will always be 0 as the distance of a point with itself is always 0. We will use the Euclidean distance formula to calculate the rest of the distances. So, let’s say we want to calculate the distance between point 1 and 2:

√(10−7)2=√9=3

Similarly, we can calculate all the distances and fill the proximity matrix.

Steps for Hierarchical Clustering

Calculate the proximity matrix: Calculate the distance or similarity measure between each pair of data points and store the values in a proximity matrix.
Initialize the clusters: At the beginning of the clustering process, each data point is treated as a separate cluster.

Determine the closest clusters: Identify the two closest clusters based on the values in the proximity matrix. This can be done by finding the minimum value in the matrix.

Merge the closest clusters: Merge the two closest clusters into a single cluster. Update the proximity matrix to reflect the new distances between the merged cluster and the remaining clusters.

Student_ID	Marks
(1,2)	10
3	28
4	20
5	35

ID	(1,2)	3	4	5
(1,2)	0	18	10	25
3	18	0	8	7
4	10	8	0	15
5	25	7	15	0

Repeat steps 3-4: Repeat steps 3-4 until all data points are in a single cluster or until a stopping criterion is met.
Visualize the dendrogram: The clustering hierarchy can be visualized as a dendrogram, which shows the order in which the clusters were merged.

Determine the number of clusters: Determine the number of clusters based on the dendrogram or by setting a threshold for the distance between clusters.

These steps apply to agglomerative clustering, which is the most common type of hierarchical clustering. Divisive clustering, on the other hand, works by recursively dividing the data points into smaller clusters until a stopping criterion is met.

Linkages Used in Hierarchical Clustering

Linkage refers to the criterion used to determine the distance between clusters in hierarchical clustering. Here are some commonly used linkage methods:

Single linkage: Also known as nearest-neighbor linkage, this method calculates the distance between the closest points of the two clusters being merged.
Complete linkage: Also known as farthest-neighbor linkage, this method calculates the distance between the farthest points of the two clusters being merged.
Average linkage: This method calculates the average distance between all pairs of points in the two clusters being merged.
Ward's linkage: This method minimizes the variance of the distances between all pairs of points within the newly formed cluster. This results in clusters that are relatively homogeneous and compact.
Centroid linkage: This method calculates the distance between the centroids of the two clusters being merged.

The choice of linkage method can have a significant impact on the resulting clusters. For example, single linkage tends to produce elongated clusters, while complete linkage produces more compact clusters. Ward's linkage tends to produce well-separated and evenly sized clusters. Therefore, it is important to choose the linkage method that best fits the characteristics of the data being clustered.

Conclusion

Hierarchical clustering is an unsupervised machine learning technique that groups data points into clusters based on their similarity. Unlike K-means, it does not require the number of clusters to be pre-defined, and it can handle large datasets. There are two main types of hierarchical clustering: agglomerative and divisive. Both methods use a proximity matrix to calculate the distances between data points or clusters.

Key Takeaways

Hierarchical clustering is an unsupervised learning method that clusters data points into groups or clusters based on their similarity.
It is performed by merging the closest pairs of points or clusters iteratively until all points belong to a single cluster or to a specified number of clusters.
Agglomerative clustering is a popular method that starts with each data point as its own cluster and iteratively merges the two closest clusters until all data points belong to a single cluster.
Divisive clustering is a method that starts with all data points in a single cluster and recursively divides the clusters until each cluster contains only one data point.
The output of both methods is a dendrogram, which is a tree-like diagram that shows the hierarchical relationships between the clusters.
The dendrogram can be used to visualize the clustering results and identify the optimal number of clusters based on the structure of the tree.
Hierarchical clustering can handle large datasets and does not require pre-defining the number of clusters, unlike K-means clustering.
A proximity matrix is used to determine the similarity or dissimilarity between data points or clusters and to merge or split them during the clustering process.

Quiz

What is hierarchical clustering?
1. A method of clustering data points into groups or clusters based on their similarity
2. A method of clustering data points based on labels
3. A supervised learning technique
4. A technique that requires pre-defined number of clusters

Answer: a. A method of clustering data points into groups or clusters based on their similarity

What are the two main types of hierarchical clustering?
1. K-means and agglomeration
2. Divisive and hierarchical
3. Agglomerative and divisive
4. Supervised and unsupervised

Answer: c. Agglomerative and divisive

What is the stopping criterion in hierarchical clustering?
1. When all data points belong to a single cluster
2. When a specified number of clusters is reached
3. A minimum distance threshold
4. All of the above

Answer: d. All of the above

What is the output of agglomerative clustering?
1. A tree-like diagram that shows the hierarchical relationships between the clusters
2. A square matrix that stores the distances between each pair of data points
3. Spherical clusters
4. A pre-defined number of clusters

Answer: a. A tree-like diagram that shows the hierarchical relationships between the clusters

What is the difference between agglomerative and divisive clustering?
1. Agglomerative clustering starts with all data points in a single cluster, whereas divisive clustering starts with each data point as its own cluster.
2. Agglomerative clustering merges the closest pairs of points or clusters, whereas divisive clustering recursively divides clusters.
3. Agglomerative clustering produces more meaningful clusters when the data has a clear structure, whereas divisive clustering is sensitive to the choice of distance metric.
4. Agglomerative clustering builds the dendrogram from the bottom up, starting with individual data points, whereas divisive clustering builds the dendrogram from the top down.

Answer: b. Agglomerative clustering merges the closest pairs of points or clusters, whereas divisive clustering recursively divides clusters.

Module 7: Unsupervised Learning