Metrics for Unsupervised Learning

All Courses (6)

Master's Degree (2)

Fellowship (2)

Certifications (2)

Woolf University

MS in Computer Science: Machine Learning and Artificial Intelligence

Woolf University

MS in Computer Science: Cloud Computing with AI System Design

Vishlesan I-Hub, IIT Patna

Professional Fellowship in Data Science and Agentic AI Engineering

Vishlesan I-Hub, IIT Patna

Professional Fellowship in Software Engineering with AI and DevOps

IBM & Microsoft

Advanced Certification in Data Analytics & Gen AI Engineering

IBM & Microsoft

Advanced Certification in Web Development & Gen AI Engineering

Course Outline

What is Unsupervised Learning?

K-means Clustering

Hierarchical Clustering in Machine Learning

Metrics for Unsupervised Learning

Last Updated: 22nd June, 2023

Overview

Metrics for unsupervised learning are utilized to degree the quality of a model's execution in unsupervised learning assignments. These measurements are regularly planned to degree the exactness of the clusters that are created by the model and/or the capacity of the model to precisely distinguish exceptions. Examples of metrics for unsupervised learning include homogeneity, completeness, V-measure, silhouette coefficient, and Davies–Bouldin index. Additionally, the results of unsupervised learning can be evaluated based on their utility and usability in terms of how well the model is able to identify meaningful patterns and relationships in the data.

Metrics

Measurements for unsupervised learning are utilized to assess the execution of clustering calculations, dimensionality reduction procedures, and other unsupervised learning strategies. There are a few common measurements utilized in unsupervised learning:

Silhouette Coefficient:

It takes values between -1 and 1, where a value near to 1 demonstrates that the information focuses inside a cluster are firmly stuffed, and the clusters are well-separated from other clusters. A value near to -1 demonstrates that the information focuses are misclassified or the clusters are overlapping. It is calculated by comparing the average distance between a data point and all other points in its cluster.

Silhouette Coefficient (SC)= (b-a)/max(a,b)

Where:

a = mean distance to other points in the same cluster

b = mean distance to other points in the next nearest cluster

Example: Let's say we have three clusters and three data points. The silhouette coefficient for each point can be calculated as follows:

Point A:

a = mean distance to other points in the same cluster (cluster 1) = 0.2

b = mean distance to other points in the next nearest cluster (cluster 2) = 0.7 SC = (0.7 - 0.2)/max(0.2, 0.7) = 0.78

Point B:

a = mean distance to other points in the same cluster (cluster 2) = 0.5

b = mean distance to other points in the next nearest cluster (cluster 3) = 0.6 SC = (0.6 - 0.5)/max(0.5, 0.6) = 0.17

Point C:

a = mean distance to other points in the same cluster (cluster 3) = 0.4

b = mean distance to other points in the next nearest cluster (cluster 1) = 0.9

SC = (0.9 - 0.4)/max(0.4, 0.9) = 0.67

Calinski-Harabasz Index:

The Calinski-Harabasz index measures the ratio of the between-cluster variance to the within-cluster variance. It takes higher values for clusters that are well-separated and dense.

Illustration: Consider a dataset comprising of two clusters, A and B. Cluster A comprises of 50 points and Cluster B comprises of 30 points, with their respective centroids found at (3, 5) and (7, 13), respectively. The within-cluster fluctuation for Cluster A is 0.2 and for Cluster B is 0.4. The between-cluster change is 0.6.

Formula: The Calinski-Harabasz index is calculated as follows:

Calinski-Harabasz index = (Between-cluster variance) / (Within-cluster variance)
= 0.6 / (0.2 + 0.4) = 1.5

Using the Calinski-Harabasz index, we can conclude that the two clusters (A and B) are relatively well-separated and dense, as the index value is greater than 1.

Davies-Bouldin Index:

The Davies-Bouldin index measures the average similarity between each cluster and its most similar cluster, relative to the average distance between each cluster and its most dissimilar cluster. It takes lower values for clusters that are well-separated and dense. The formula for the DBI is as follows:

DBI = (1/K) * Σmax(sim(c_i, c_j))

Where K is the number of clusters, c_i and c_j are two different clusters, and sim is the similarity function (usually the ratio of the sum of intra-cluster distances to the distance between the two clusters).

For example, if we have three clusters A, B, and C, we can calculate the DBI as follows:

DBI = (1/3) * (max(sim(A, B)) + max(sim(A, C)) + max(sim(B, C)))

For example, if the sum of intra-cluster distances for clusters A, B, and C are 10, 20, and 30 respectively, and the distance between clusters A and B is 8, A and C is 12, and B and C is 10, then the DBI would be:

DBI = (1/3) * (max(8/10, 20/20) + max(8/10, 30/12) + max(20/20, 30/10)) = (1/3) * (8/10 + 30/12 + 20/20) = (1/3) * (2 + 2.5 + 1) = (1/3) * 5.5 = 1.83

Adjusted Rand Index:

The adjusted rand index measures the similarity between the true labels and the predicted labels, taking into account chance agreements. It takes values between -1 and 1, where a value close to 1 indicates that the predicted labels are identical to the true labels.

The formula for the ARI is:

ARI = (Σi,j[aij - (ai.)(aj.)/n2 - (Σi(ai.)2/n2)(Σj(aj.)2/n2)]) / (0.5[Σi(ai.)2/n2 + Σj(aj.)2/n2])

where aij is the number of components within the crossing point of clusters i and j, ai. is the whole of components in cluster i, and n is the entire number of elements within the information set.

For illustration, on the off chance that you've got a information set of 100 components and two segments of the same information set, one with 50 components in cluster A and 50 components in cluster B, and the other with 60 components in cluster A and 40 components in cluster B, the ARI would be calculated as follows:

ARI = (5060 - (5050 + 5040)/100100 - (5050/100100)(4040/100100)) / (0.5[5050/100100 + 4040/100100])
ARI = (3000 - 2000/10000 - (2500/10000)(1600/10000)) / (0.5[2500/10000 + 1600/10000])
ARI = (1000 - 0.2) / (0.5*(0.41))
ARI = 4.9 / 0.205
ARI = 24

Mutual Information:

The mutual information measures the amount of information shared between the true labels and the predicted labels. It takes values between 0 and 1, where a value close to 1 indicates that the predicted labels are identical to the true labels.

The formula for Mutual Information is as follows:

MI(X,Y) = ∑x∈X∑y∈Y p(x,y) log2 (p(x,y) / p(x)p(y))

For example, if we wanted to measure the mutual information between two variables X and Y, we could use the following:

Let X = Number of hours of sleep Let Y = Number of hours of studying

Then, MI(X,Y) = ∑x ∈ X ∑y ∈ Y p(x,y) log2 (p(x,y) / p(x)p(y))

Conclusion

It is important to note that different metrics may be appropriate for different types of unsupervised learning tasks, and the choice of metric should be made based on the specific requirements of the problem at hand.

Key takeaways

Use clustering metrics such as Silhouette Coefficient and Davies-Bouldin index to assess the quality and performance of clusters.
Use Adjusted Rand Index and Mutual Information to measure the similarity between two clusters.
Use Root Mean Square Error to measure the accuracy of the clustering.
Use perplexity and log-likelihood scores to measure the quality of a generative model.
Use precision-recall curves to measure the performance of a clustering algorithm.
Use silhouette scores to assess the degree of separation between clusters.
Use silhouette scores to assess the degree of overlap between clusters.
Use Average Linkage, Single Linkage, and Complete Linkage algorithms to measure the similarity between clusters.

Quiz

What is the most common unsupervised learning method?
1. K-Means Clustering
2. Hierarchical Clustering
3. Reinforcement Learning
4. Naive Bayes

Answer: a. K-Means Clustering

Which of the following is not an evaluation metric for unsupervised learning?
1. Precision
2. Accuracy
3. Silhouette Score
4. F-Score

Answer: d. F-Score

What is the goal of unsupervised learning?
1. To identify patterns in data
2. To predict labels
3. To create new data
4. To classify data

Answer: a. To identify patterns in data

What is the most commonly used metric to measure the quality of clusters?
1. Mutual Information
2. Adjusted Rand Index
3. Mean Squared Error
4. Root Mean Squared Error

Answer:b. Adjusted Rand Index

Module 7: Unsupervised Learning