Clustering is an unsupervised machine learning method for partitioning dataset into a set of groups or clusters. A big issue is that clustering methods will return clusters even if the data does not contain any clusters. Therefore, it’s necessary i) to assess clustering tendency before the analysis and ii) to validate the quality of the result after clustering.
A variety of measures has been proposed in the literature for evaluating clustering results. The term clustering validation is used to design the procedure of evaluating the results of a clustering algorithm.
Generally, clustering validation statistics can be categorized into 4 classes (Theodoridis and Koutroubas, 2008; G. Brock et al., 2008, Charrad et al., 2014):
Relative clustering validation, which evaluates the clustering structure by varying different parameter values for the same algorithm (e.g.,: varying the number of clusters k). It’s generally used for determining the optimal number of clusters.
External clustering validation, which consists in comparing the results of a cluster analysis to an externally known result, such as externally provided class labels. Since we know the “true” cluster number in advance, this approach is mainly used for selecting the right clustering algorithm for a specific dataset.
Internal clustering validation, which use the internal information of the clustering process to evaluate the goodness of a clustering structure without reference to external information. It can be also used for estimating the number of clusters and the appropriate clustering algorithm without any external data.
The aim of this article is to:
In all the examples presented here, we’ll apply k-means, PAM and hierarchical clustering. Note that, the functions used in this article can be applied to evaluate the validity of any other clustering methods.
The following packages will be used:
Install factoextra package as follow:
if(!require(devtools)) install.packages("devtools")
devtools::install_github("kassambara/factoextra")
The remaining packages can be installed using the code below:
pkgs <- c("cluster", "fpc", "NbClust")
install.packages(pkgs)
Load packages:
library(factoextra)
library(cluster)
library(fpc)
library(NbClust)
The data set iris is used. We start by excluding the column “Species” and scaling the data using the function scale():
# Load the data
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# Remove species column (5) and scale the data
iris.scaled <- scale(iris[, -5])
Iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Many indices (more than 30) has been published in the literature for finding the right number of clusters in a dataset. The process has been covered in my previous article: Determining the optimal number of clusters.
In this section we’ll use the package NbClust which will compute, with a single function call, 30 indices for deciding the right number of clusters in the dataset:
# Compute the number of clusters
library(NbClust)
nb <- NbClust(iris.scaled, distance = "euclidean", min.nc = 2,
max.nc = 10, method = "complete", index ="all")
# Visualize the result
library(factoextra)
fviz_nbclust(nb) + theme_minimal()
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 1 proposed 1 as the best number of clusters
## * 2 proposed 2 as the best number of clusters
## * 18 proposed 3 as the best number of clusters
## * 3 proposed 10 as the best number of clusters
##
## Conclusion
## =========================
## * Accoridng to the majority rule, the best number of clusters is 3 .
We’ll use the function eclust() [in factoextra] which provides several advantages as described in the previous chapter: Visual Enhancement of Clustering Analysis.
eclust() stands for enhanced clustering. It simplifies the workflow of clustering analysis and, it can be used to compute hierarchical clustering and partititioning clustering in a single line function call.
K-means and PAM clustering are described in this section. We’ll split the data into 3 clusters as follow:
# K-means clustering
km.res <- eclust(iris.scaled, "kmeans", k = 3,
nstart = 25, graph = FALSE)
# k-means group number of each observation
km.res$cluster
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3
## [71] 2 3 3 3 3 2 2 2 3 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2
## [106] 2 3 2 2 2 2 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 2 2 2 2 2 2 3 3 2 2 2 3 2
## [141] 2 2 3 2 2 2 3 2 2 3
# Visualize k-means clusters
fviz_cluster(km.res, geom = "point", frame.type = "norm")
# PAM clustering
pam.res <- eclust(iris.scaled, "pam", k = 3, graph = FALSE)
pam.res$cluster
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 2 3 3 3 3 3 3 3 3 2 3 3 3 3
## [71] 3 3 3 3 3 2 2 2 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2
## [106] 2 3 2 2 2 2 2 2 3 2 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 3 2 2 2 3 2
## [141] 2 2 3 2 2 2 3 2 2 3
# Visualize pam clusters
fviz_cluster(pam.res, geom = "point", frame.type = "norm")
Read more about partitioning methods: Partitioning clustering
# Enhanced hierarchical clustering
res.hc <- eclust(iris.scaled, "hclust", k = 3,
method = "complete", graph = FALSE)
head(res.hc$cluster, 15)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# Dendrogram
fviz_dend(res.hc, rect = TRUE, show_labels = FALSE)
Read more about hierarchical clustering: Hierarchical clustering
In this section, we describe the most widely used clustering validation indices. Recall that the goal of clustering algorithms is to split the dataset into clusters of objects, such that:
That is, we want the average distance within cluster to be as small as possible; and the average distance between clusters to be as large as possible.
Internal validation measures reflect often the compactness, the connectedness and separation of the cluster partitions.
Compactness measures evaluate how close are the objects within the same cluster. A lower within-cluster variation is an indicator of a good compactness (i.e., a good clustering). The different indices for evaluating the compactness of clusters are base on distance measures such as the cluster-wise within average/median distances between observations.
Generally most of the indices used for internal clustering validation combine compactness and separation measures as follow:
\[ Index = \frac{(\alpha \times Separation)}{(\beta \times Compactness)} \]
Where \(\alpha\) and \(\beta\) are weights.
In this section, we’ll describe the two commonly used indices for assessing the goodness of clustering: silhouette width and Dunn index.
Recall that, more than 30 indices has been published in literature. They can be easily computed using the function NbClust which has been described in my previous article: Determining the optimal number of clusters.
Silhouette analysis measures how well an observation is clustered and it estimates the average distance between clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters.
For each observation \(i\), the silhouette width \(s_i\) is calculated as follows:
For all other clusters \(C\), to which i does not belong, calculate the average dissimilarity \(d(i, C)\) of \(i\) to all observations of C. The smallest of these \(d(i,C)\) is defined as \(b_i= \min_C d(i,C)\). The value of \(b_i\) can be seen as the dissimilarity between \(i\) and its “neighbor” cluster, i.e., the nearest one to which it does not belong.
Silhouette width can be interpreted as follow:
Observations with a large \(S_i\) (almost 1) are very well clustered
A small \(S_i\) (around 0) means that the observation lies between two clusters
The silhouette coefficient of observations can be computed using the function silhouette() [in cluster package]:
silhouette(x, dist, ...)
The function silhouette() returns an object, of class silhouette containing:
The R code below computes silhouette analysis and draw the result using R base plot:
# Silhouette coefficient of observations
library("cluster")
sil <- silhouette(km.res$cluster, dist(iris.scaled))
head(sil[, 1:3], 10)
## cluster neighbor sil_width
## [1,] 1 3 0.7341949
## [2,] 1 3 0.5682739
## [3,] 1 3 0.6775472
## [4,] 1 3 0.6205016
## [5,] 1 3 0.7284741
## [6,] 1 3 0.6098848
## [7,] 1 3 0.6983835
## [8,] 1 3 0.7308169
## [9,] 1 3 0.4882100
## [10,] 1 3 0.6315409
# Silhouette plot
plot(sil, main ="Silhouette plot - K-means")
Use factoextra for elegant data visualization:
library(factoextra)
fviz_silhouette(sil)
The summary of the silhouette analysis can be computed using the function summary.silhouette() as follow:
# Summary of silhouette analysis
si.sum <- summary(sil)
# Average silhouette width of each cluster
si.sum$clus.avg.widths
## 1 2 3
## 0.6363162 0.3473922 0.3933772
# The total average (mean of all individual silhouette widths)
si.sum$avg.width
## [1] 0.4599482
# The size of each clusters
si.sum$clus.sizes
## cl
## 1 2 3
## 50 47 53
Note that, if the clustering analysis is done using the function eclust(), cluster silhouettes are computed automatically and stored in the object silinfo. The results can be easily visualized as shown in the next sections.
It’s possible to draw silhouette plot using the function fviz_silhouette() [in factoextra package], which will also print a summary of the silhouette analysis output. To avoid this, you can use the option print.summary = FALSE.
# Default plot
fviz_silhouette(km.res)
## cluster size ave.sil.width
## 1 1 50 0.64
## 2 2 47 0.35
## 3 3 53 0.39
# Change the theme and color
fviz_silhouette(km.res, print.summary = FALSE) +
scale_fill_brewer(palette = "Dark2") +
scale_color_brewer(palette = "Dark2") +
theme_minimal()+
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
Silhouette information can be extracted as follow:
# Silhouette information
silinfo <- km.res$silinfo
names(silinfo)
## [1] "widths" "clus.avg.widths" "avg.width"
# Silhouette widths of each observation
head(silinfo$widths[, 1:3], 10)
## cluster neighbor sil_width
## 1 1 3 0.7341949
## 41 1 3 0.7333345
## 8 1 3 0.7308169
## 18 1 3 0.7287522
## 5 1 3 0.7284741
## 40 1 3 0.7247047
## 38 1 3 0.7244191
## 12 1 3 0.7217939
## 28 1 3 0.7215103
## 29 1 3 0.7145192
# Average silhouette width of each cluster
silinfo$clus.avg.widths
## [1] 0.6363162 0.3473922 0.3933772
# The total average (mean of all individual silhouette widths)
silinfo$avg.width
## [1] 0.4599482
# The size of each clusters
km.res$size
## [1] 50 47 53
fviz_silhouette(pam.res)
## cluster size ave.sil.width
## 1 1 50 0.63
## 2 2 45 0.35
## 3 3 55 0.38
fviz_silhouette(res.hc)
## cluster size ave.sil.width
## 1 1 49 0.75
## 2 2 75 0.37
## 3 3 26 0.51
It can be seen that several samples have a negative silhouette coefficient in the hierarchical clustering. This means that they are not in the right cluster.
We can find the name of these samples and determine the clusters they are closer (neighbor cluster), as follow:
# Silhouette width of observation
sil <- res.hc$silinfo$widths[, 1:3]
# Objects with negative silhouette
neg_sil_index <- which(sil[, 'sil_width'] < 0)
sil[neg_sil_index, , drop = FALSE]
## cluster neighbor sil_width
## 51 2 3 -0.02848264
## 148 2 3 -0.03799687
## 129 2 3 -0.09622863
## 111 2 3 -0.14461589
## 109 2 3 -0.14991556
## 133 2 3 -0.18730218
## 42 2 1 -0.39515010
Dunn index is another internal clustering validation measure which can be computed as follow:
Use the minimum of this pairwise distance as the inter-cluster separation (min.separation)
Use the maximal intra-cluster distance (i.e maximum diameter) as the intra-cluster compactness
Calculate Dunn index (D) as follow:
\[ D = \frac{min.separation}{max.diameter} \]
If the data set contains compact and well-separated clusters, the diameter of the clusters is expected to be small and the distance between the clusters is expected to be large. Thus, Dunn index should be maximized.
The function cluster.stats() [in fpc package] and the function NbClust() [in NbClust package] can be used to compute Dunn index and many other indices.
The function cluster.stats() is described in the next section.
In this section, we’ll describe the R function cluster.stats() [in fpc package] for computing a number of distance based statistics which can be used either for cluster validation, comparison between clustering and decision about the number of clusters.
The simplified format is:
cluster.stats(d = NULL, clustering, al.clustering = NULL)
The function cluster.stats() returns a list containing many components useful for analyzing the intrinsic characteristics of a clustering:
All the above elements can be used to evaluate the internal quality of clustering.
In the following sections, we’ll compute the clustering quality statistics for k-means, pam and hierarchical clustering. Look at the within.cluster.ss (within clusters sum of squares), the average.within (average distance within clusters) and clus.avg.silwidths (vector of cluster average silhouette widths).
library(fpc)
# Compute pairwise-distance matrices
dd <- dist(iris.scaled, method ="euclidean")
# Statistics for k-means clustering
km_stats <- cluster.stats(dd, km.res$cluster)
# (k-means) within clusters sum of squares
km_stats$within.cluster.ss
## [1] 138.8884
# (k-means) cluster average silhouette widths
km_stats$clus.avg.silwidths
## 1 2 3
## 0.6363162 0.3473922 0.3933772
# Display all statistics
km_stats
## $n
## [1] 150
##
## $cluster.number
## [1] 3
##
## $cluster.size
## [1] 50 47 53
##
## $min.cluster.size
## [1] 47
##
## $noisen
## [1] 0
##
## $diameter
## [1] 5.034198 3.343671 2.922371
##
## $average.distance
## [1] 1.175155 1.307716 1.197061
##
## $median.distance
## [1] 0.9884177 1.2383531 1.1559887
##
## $separation
## [1] 1.5533592 0.1333894 0.1333894
##
## $average.toother
## [1] 3.647912 3.081212 2.674298
##
## $separation.matrix
## [,1] [,2] [,3]
## [1,] 0.000000 2.4150235 1.5533592
## [2,] 2.415024 0.0000000 0.1333894
## [3,] 1.553359 0.1333894 0.0000000
##
## $ave.between.matrix
## [,1] [,2] [,3]
## [1,] 0.000000 4.129179 3.221129
## [2,] 4.129179 0.000000 2.092563
## [3,] 3.221129 2.092563 0.000000
##
## $average.between
## [1] 3.130708
##
## $average.within
## [1] 1.222246
##
## $n.between
## [1] 7491
##
## $n.within
## [1] 3684
##
## $max.diameter
## [1] 5.034198
##
## $min.separation
## [1] 0.1333894
##
## $within.cluster.ss
## [1] 138.8884
##
## $clus.avg.silwidths
## 1 2 3
## 0.6363162 0.3473922 0.3933772
##
## $avg.silwidth
## [1] 0.4599482
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.679696
##
## $dunn
## [1] 0.02649665
##
## $dunn2
## [1] 1.600166
##
## $entropy
## [1] 1.097412
##
## $wb.ratio
## [1] 0.3904057
##
## $ch
## [1] 241.9044
##
## $cwidegap
## [1] 1.3892251 0.9432249 0.7824508
##
## $widestgap
## [1] 1.389225
##
## $sindex
## [1] 0.3524812
##
## $corrected.rand
## NULL
##
## $vi
## NULL
Read the documentation of cluster.stats() for details about all the available indices.
The same statistics can be computed for pam clustering and hierarchical clustering.
# Statistics for pam clustering
pam_stats <- cluster.stats(dd, pam.res$cluster)
# (pam) within clusters sum of squares
pam_stats$within.cluster.ss
## [1] 140.2856
# (pam) cluster average silhouette widths
pam_stats$clus.avg.silwidths
## 1 2 3
## 0.6346397 0.3496332 0.3823817
# Statistics for hierarchical clustering
hc_stats <- cluster.stats(dd, res.hc$cluster)
# (HCLUST) within clusters sum of squares
hc_stats$within.cluster.ss
## [1] 152.7107
# (HCLUST) cluster average silhouette widths
hc_stats$clus.avg.silwidths
## 1 2 3
## 0.6688130 0.3154184 0.4488197
The aim is to compare the identified clusters (by k-means, pam or hierarchical clustering) to a reference.
To compare two cluster solutions, use the cluster.stats() function as follow:
res.stat <- cluster.stats(d, solution1$cluster, solution2$cluster)
Among the values returned by the function cluster.stats(), there are two indexes to assess the similarity of two clustering, namely the corrected Rand index and Meila’s VI.
We know that the iris data contains exactly 3 groups of species.
Does the K-means clustering matches with the true structure of the data?
We can use the function cluster.stats() to answer to this question.
A cross-tabulation can be computed as follow:
table(iris$Species, km.res$cluster)
##
## 1 2 3
## setosa 50 0 0
## versicolor 0 11 39
## virginica 0 36 14
It can be seen that:
It’s possible to quantify the agreement between Species and k-means clusters using either the corrected Rand index and Meila’s VI provided as follow:
library("fpc")
# Compute cluster stats
species <- as.numeric(iris$Species)
clust_stats <- cluster.stats(d = dist(iris.scaled),
species, km.res$cluster)
# Corrected Rand index
clust_stats$corrected.rand
## [1] 0.6201352
# VI
clust_stats$vi
## [1] 0.7477749
The corrected Rand index provides a measure for assessing the similarity between two partitions, adjusted for chance. Its range is -1 (no agreement) to 1 (perfect agreement). Agreement between the specie types and the cluster solution is 0.62 using Rand index and 0.748 using Meila’s VI
The same analysis can be computed for both pam and hierarchical clustering:
# Agreement between species and pam clusters
table(iris$Species, pam.res$cluster)
##
## 1 2 3
## setosa 50 0 0
## versicolor 0 9 41
## virginica 0 36 14
cluster.stats(d = dist(iris.scaled),
species, pam.res$cluster)$vi
## [1] 0.7129034
# Agreement between species and HC clusters
table(iris$Species, res.hc$cluster)
##
## 1 2 3
## setosa 49 1 0
## versicolor 0 50 0
## virginica 0 24 26
cluster.stats(d = dist(iris.scaled),
species, res.hc$cluster)$vi
## [1] 0.6097098
External clustering validation, can be used to select suitable clustering algorithm for a given dataset.
This analysis has been performed using R software (ver. 3.2.1)