Bytes

Metrics for Unsupervised Learning

Module - 7 Unsupervised Learning
Metrics for Unsupervised Learning

Overview

Metrics for unsupervised learning are utilized to degree the quality of a model's execution in unsupervised learning assignments. These measurements are regularly planned to degree the exactness of the clusters that are created by the model and/or the capacity of the model to precisely distinguish exceptions. Examples of metrics for unsupervised learning include homogeneity, completeness, V-measure, silhouette coefficient, and Davies–Bouldin index. Additionally, the results of unsupervised learning can be evaluated based on their utility and usability in terms of how well the model is able to identify meaningful patterns and relationships in the data.

Metrics

Measurements for unsupervised learning are utilized to assess the execution of clustering calculations, dimensionality reduction procedures, and other unsupervised learning strategies. There are a few common measurements utilized in unsupervised learning:

Silhouette Coefficient:

It takes values between -1 and 1, where a value near to 1 demonstrates that the information focuses inside a cluster are firmly stuffed, and the clusters are well-separated from other clusters. A value near to -1 demonstrates that the information focuses are misclassified or the clusters are overlapping. It is calculated by comparing the average distance between a data point and all other points in its cluster.

Silhouette Coefficient (SC)= (b-a)/max(a,b)

Where:

a = mean distance to other points in the same cluster 

b = mean distance to other points in the next nearest cluster

Example: Let's say we have three clusters and three data points. The silhouette coefficient for each point can be calculated as follows: 

Point A:

a = mean distance to other points in the same cluster (cluster 1) = 0.2 

b = mean distance to other points in the next nearest cluster (cluster 2) = 0.7 SC = (0.7 - 0.2)/max(0.2, 0.7) = 0.78 

Point B:

a = mean distance to other points in the same cluster (cluster 2) = 0.5 

b = mean distance to other points in the next nearest cluster (cluster 3) = 0.6 SC = (0.6 - 0.5)/max(0.5, 0.6) = 0.17 

Point C:

a = mean distance to other points in the same cluster (cluster 3) = 0.4 

b = mean distance to other points in the next nearest cluster (cluster 1) = 0.9 

SC = (0.9 - 0.4)/max(0.4, 0.9) = 0.67

Calinski-Harabasz Index:

The Calinski-Harabasz index measures the ratio of the between-cluster variance to the within-cluster variance. It takes higher values for clusters that are well-separated and dense.

Illustration: Consider a dataset comprising of two clusters, A and B. Cluster A comprises of 50 points and Cluster B comprises of 30 points, with their respective centroids found at (3, 5) and (7, 13), respectively. The within-cluster fluctuation for Cluster A is 0.2 and for Cluster B is 0.4. The between-cluster change is 0.6.

Formula: The Calinski-Harabasz index is calculated as follows:

Calinski-Harabasz index = (Between-cluster variance) / (Within-cluster variance)
= 0.6 / (0.2 + 0.4) = 1.5

Using the Calinski-Harabasz index, we can conclude that the two clusters (A and B) are relatively well-separated and dense, as the index value is greater than 1.

Davies-Bouldin Index:

The Davies-Bouldin index measures the average similarity between each cluster and its most similar cluster, relative to the average distance between each cluster and its most dissimilar cluster. It takes lower values for clusters that are well-separated and dense. The formula for the DBI is as follows:

DBI = (1/K) * Σmax(sim(c_i, c_j))

Where K is the number of clusters, c_i and c_j are two different clusters, and sim is the similarity function (usually the ratio of the sum of intra-cluster distances to the distance between the two clusters).

For example, if we have three clusters A, B, and C, we can calculate the DBI as follows:

DBI = (1/3) * (max(sim(A, B)) + max(sim(A, C)) + max(sim(B, C)))

For example, if the sum of intra-cluster distances for clusters A, B, and C are 10, 20, and 30 respectively, and the distance between clusters A and B is 8, A and C is 12, and B and C is 10, then the DBI would be:

DBI = (1/3) * (max(8/10, 20/20) + max(8/10, 30/12) + max(20/20, 30/10)) = (1/3) * (8/10 + 30/12 + 20/20) = (1/3) * (2 + 2.5 + 1) = (1/3) * 5.5 = 1.83

Adjusted Rand Index:

The adjusted rand index measures the similarity between the true labels and the predicted labels, taking into account chance agreements. It takes values between -1 and 1, where a value close to 1 indicates that the predicted labels are identical to the true labels.

The formula for the ARI is:

ARI = (Σi,j[aij - (ai.)(aj.)/n2 - (Σi(ai.)2/n2)(Σj(aj.)2/n2)]) / (0.5[Σi(ai.)2/n2 + Σj(aj.)2/n2])

where aij is the number of components within the crossing point of clusters i and j, ai. is the whole of components in cluster i, and n is the entire number of elements within the information set.

For illustration, on the off chance that you've got a information set of 100 components and two segments of the same information set, one with 50 components in cluster A and 50 components in cluster B, and the other with 60 components in cluster A and 40 components in cluster B, the ARI would be calculated as follows:

ARI = (5060 - (5050 + 5040)/100100 - (5050/100100)(4040/100100)) / (0.5[5050/100100 + 4040/100100])
ARI = (3000 - 2000/10000 - (2500/10000)(1600/10000)) / (0.5[2500/10000 + 1600/10000])
ARI = (1000 - 0.2) / (0.5*(0.41))
ARI = 4.9 / 0.205
ARI = 24

Mutual Information:

The mutual information measures the amount of information shared between the true labels and the predicted labels. It takes values between 0 and 1, where a value close to 1 indicates that the predicted labels are identical to the true labels.

The formula for Mutual Information is as follows:

MI(X,Y) = ∑x∈X∑y∈Y p(x,y) log2 (p(x,y) / p(x)p(y))

For example, if we wanted to measure the mutual information between two variables X and Y, we could use the following:

Let X = Number of hours of sleep Let Y = Number of hours of studying

Then, MI(X,Y) = ∑x ∈ X ∑y ∈ Y p(x,y) log2 (p(x,y) / p(x)p(y))

Conclusion

It is important to note that different metrics may be appropriate for different types of unsupervised learning tasks, and the choice of metric should be made based on the specific requirements of the problem at hand.

Key takeaways

  1. Use clustering metrics such as Silhouette Coefficient and Davies-Bouldin index to assess the quality and performance of clusters.
  2. Use Adjusted Rand Index and Mutual Information to measure the similarity between two clusters.
  3. Use Root Mean Square Error to measure the accuracy of the clustering.
  4. Use perplexity and log-likelihood scores to measure the quality of a generative model.
  5. Use precision-recall curves to measure the performance of a clustering algorithm.
  6. Use silhouette scores to assess the degree of separation between clusters.
  7. Use silhouette scores to assess the degree of overlap between clusters.
  8. Use Average Linkage, Single Linkage, and Complete Linkage algorithms to measure the similarity between clusters.

Quiz

  1. What is the most common unsupervised learning method?
    1. K-Means Clustering 
    2. Hierarchical Clustering 
    3. Reinforcement Learning 
    4. Naive Bayes

Answer: a. K-Means Clustering

  1. Which of the following is not an evaluation metric for unsupervised learning?  
    1. Precision 
    2. Accuracy 
    3. Silhouette Score 
    4. F-Score

Answer: d. F-Score

  1. What is the goal of unsupervised learning?  
    1. To identify patterns in data  
    2. To predict labels  
    3. To create new data 
    4. To classify data

Answer: a. To identify patterns in data

  1. What is the most commonly used metric to measure the quality of clusters? 
    1. Mutual Information  
    2. Adjusted Rand Index  
    3. Mean Squared Error  
    4. Root Mean Squared Error

Answer:b. Adjusted Rand Index

Recommended Courses
Masters in CS: Data Science and Artificial Intelligence
Course
20,000 people are doing this course
Join India's only Pay after placement Master's degree in Data Science. Get an assured job of 5 LPA and above. Accredited by ECTS and globally recognised in EU, US, Canada and 60+ countries.
Certification in Full Stack Data Science and AI
Course
20,000 people are doing this course
Become a job-ready Data Science professional in 30 weeks. Join the largest tech community in India. Pay only after you get a job above 5 LPA.

AlmaBetter’s curriculum is the best curriculum available online. AlmaBetter’s program is engaging, comprehensive, and student-centered. If you are honestly interested in Data Science, you cannot ask for a better platform than AlmaBetter.

avatar
Kamya Malhotra
Statistical Analyst
Fast forward your career in tech with AlmaBetter

Vikash SrivastavaCo-founder & CPTO AlmaBetter

Vikas CTO

Related Tutorials to watch

Top Articles toRead

AlmaBetter
Made with heartin Bengaluru, India
  • Official Address
  • 4th floor, 133/2, Janardhan Towers, Residency Road, Bengaluru, Karnataka, 560025
  • Communication Address
  • 4th floor, 315 Work Avenue, Siddhivinayak Tower, 152, 1st Cross Rd., 1st Block, Koramangala, Bengaluru, Karnataka, 560034
  • Follow Us
  • facebookinstagramlinkedintwitteryoutubetelegram

© 2023 AlmaBetter