Dissimilarity measure is a numerical measure of how different two data objects are.
A metric(Dissimilarity measure) on a set X is a function:
\[d: X \times X \rightarrow[0, \infty)\] where \([0,\infty )\) is the set of non-negative real numbers and for all \(x,y,z\in X\), the following conditions are satisfied:
(Theodoridis and Koutroumbas 2008)
Definition:
The Minkowski distance of order \(p\) between two points \(X\ \text{and}\ Y\), is define as \[ d(X,Y)=\bigg(\sum_{i=n}^{n}|x_i-y_i|^p\bigg)^{\frac{1}{p}} \] Where:
\[X=\left(x_{1}, x_{2}, \ldots, x_{n}\right) \text { and } Y=\left(y_{1}, y_{2}, \ldots, y_{n}\right) \in \mathbb{R}^{n}\] For \(p\ge 1\), the Minkowski distance is a metric. But the triangle inequality was breaked when \(p<1\). \(When p<1\), the distance between \((0,0)\) and \((1,1)\) is \(2^{1/p}>2\), but distance is \(1\) from point (0,1) to both of those points . Since this violates the triangle inequality, for \(p<1\), \(d\) is not a metric.
Special Cases:
Similarity measure is a numerical measure of how alike two data objects are. Often falls in the range [0,1].
Similarity measure \(s\) have the following properties:
(Parilina 2020)
The popular method for similarity measure include:
(Cock et al. 2009)
The results obtained by the above method often fall in range \([-1,1]\), we have to adjust them into \([0,1]\).
The distance corresponding \(d\) to similarity measure \(s\) is:
\[ d(x,y)=1-s(x,y) \]
The similarity corresponding \(s\) to Dissimilarity measure \(d\) is:
\[ s(x,y)=\frac{d(x,y)}{1+d(x,y)} \]
To read “txt” files, I use R function - read.table().
read.table('../datasets/file01.txt',header = TRUE,
na.strings = 'NA',row.names = 1,dec='.') -> file01.data
# factor to character
lapply(file01.data,function(x) as.numeric(as.character(x)))-> file01.data[]
For each data, descriptions are as follows:
The table below shown the first five lines of the data:
Node | Inclination | Axis | |
---|---|---|---|
1935RF | 130.916 | 4.659 | 2.2562 |
1941FD | 132.200 | 4.700 | 2.1300 |
1955QT | 130.070 | 4.790 | 2.1893 |
1940YL | 338.333 | 16.773 | 2.7465 |
1953NH | 339.625 | 16.067 | 2.7335 |
Hierarchical clustering does not require to pre-specify the number of clusters to be generated, it can be subdivided into two types:
Agglomerative clustering
# Compute hierarchical clustering
res.hc <- file01.data %>%
na.omit() %>% # omit NA value
scale() %>% # Scale the data
dist(method = "euclidean") %>% # Compute dissimilarity matrix
hclust(method = "ward.D2") # Compute hierachical clustering
# Visualize using factoextra
# Cut in 4 groups and color by groups
fviz_dend(res.hc, k=4,# Cut in four groups
cex = 0.5,# label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE # Add rectangle around groups
,repel = T,
rect_fill = T
)
Divise clustering
res.diana <- diana(file01.data, stand = TRUE)
fviz_dend(res.diana, cex = 0.5,
k = 4, # Cut in four groups
palette = "jco" # Color palette
)
To determining the optimal number of clusters, I use function NbClust() in package NbClust.
manhattan distance
# library("NbClust")
res.nbclust <- file01.data %>%
scale() %>%
NbClust(distance = "manhattan",
min.nc = 2, max.nc = 10,
method = "kmeans", index ="all")
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 7 proposed 2 as the best number of clusters
## * 6 proposed 3 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 2 proposed 9 as the best number of clusters
## * 7 proposed 10 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 2 .
euclidean distance
library("NbClust")
res.nbclust <- file01.data %>%
scale() %>%
NbClust(distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans", index ="all")
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 6 proposed 2 as the best number of clusters
## * 6 proposed 3 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 2 proposed 9 as the best number of clusters
## * 8 proposed 10 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 10 .
From the results above, we can find the optimal number of clusters under the euclidean and manhattan are equal to 10 and 2 respectively.
Since kmeans() function in R using euclidean metric. I set \(\text{centers}=2\).
kmeans(file01.data, centers = 10,nstart = 25) %>%
fviz_cluster(., data = file01.data,
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal())+
labs(subtitle = "K-means algorithm ")
The k-medoids algorithm is a clustering approach related to k-means clustering for partitioning a data set into k groups or clusters. But \(pam()\) function in R can customize metric, “euclidean” or “manhattan”. I use this function to try different distance methods.
# Compute PAM
file01.data %>%
pam(., 10,metric = "euclidean") %>%
# Visualize
fviz_cluster()+
labs(subtitle = "PAM algorithm with euclidean metric k=10 ")
# Compute PAM
file01.data %>%
pam(., 2,metric = "manhattan") %>%
# Visualize
fviz_cluster()+
labs(subtitle = "PAM algorithm with manhattan metric k=2 ")
To read “txt” files, I use R function - read.table().
read.table('../datasets/file02.txt',header = TRUE,
dec = '.',na.strings = 'NA',row.names = 1) -> file02.data
# factor to character
lapply(file02.data,function(x) as.numeric(as.character(x)))-> file02.data[]
For each data, descriptions are as follows:
The table below shown the first five lines of the data:
Water.. | Protein.. | Fat.. | Lactose.. | |
---|---|---|---|---|
Horse | 90.1 | 2.6 | 1.0 | 6.9 |
Donkey | 90.3 | 1.7 | 1.4 | 6.2 |
Mule | 90.0 | 2.0 | 1.8 | 5.5 |
Camel | 87.7 | 3.5 | 3.4 | 4.8 |
Llama | 86.5 | 3.9 | 3.2 | 5.6 |
Agglomerative clustering
# Compute hierarchical clustering
res.hc <- file02.data %>%
na.omit() %>% # omit NA value
scale() %>% # Scale the data
dist(method = "euclidean") %>% # Compute dissimilarity matrix
hclust(method = "ward.D2") # Compute hierachical clustering
# Visualize using factoextra
# Cut in 4 groups and color by groups
fviz_dend(res.hc, k=4,# Cut in four groups
cex = 0.5,# label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE # Add rectangle around groups
)
Divise clustering
res.diana <- diana(file02.data, stand = TRUE)
fviz_dend(res.diana, cex = 0.5,
k = 4, # Cut in four groups
palette = "jco" # Color palette
)
To determining the optimal number of clusters, I use function NbClust() in package NbClust.
manhattan distance
library("NbClust")
res.nbclust <- file02.data %>%
scale() %>%
NbClust(distance = "manhattan",
min.nc = 2, max.nc = 10,
method = "kmeans", index ="all")
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 7 proposed 2 as the best number of clusters
## * 6 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 3 proposed 7 as the best number of clusters
## * 3 proposed 9 as the best number of clusters
## * 4 proposed 10 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 2 .
euclidean distance
library("NbClust")
res.nbclust <- file02.data %>%
scale() %>%
NbClust(distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans", index ="all")
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 7 proposed 2 as the best number of clusters
## * 6 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 3 proposed 7 as the best number of clusters
## * 3 proposed 9 as the best number of clusters
## * 4 proposed 10 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 2 .
From the results above, we can find the optimal number of clusters under the euclidean and manhattan are both equal to 2.
kmeans(file02.data, centers = 2,nstart = 25) %>%
fviz_cluster(., data = file02.data,
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal())+
labs(subtitle = "K-means algorithm ")
# Compute PAM
file02.data %>%
pam(., 2,metric = "euclidean") %>%
# Visualize
fviz_cluster()+
labs(subtitle = "PAM algorithm with euclidean metric k=2 ")
# Compute PAM
file02.data %>%
pam(., 2,metric = "manhattan") %>%
# Visualize
fviz_cluster()+
labs(subtitle = "PAM algorithm with manhattan metric k=2 ")
To read “txt” files, I use R function - read.table().
read.table('../datasets/file07.txt',header = TRUE,
dec = '.',na.strings = 'NA') -> file07.data
# build unique label for each data
row.names(file07.data)<-paste(file07.data$Country,file07.data$Year,sep='-')
file07.data[-c(1,2)]->file07.data
# factor to character
lapply(file07.data,function(x) as.numeric(as.character(x)))-> file07.data[]
For each data, descriptions are as follows:
The table below shown the first five lines of the data:
M0 | M25 | M50 | M75 | F0 | F25 | F50 | F75 | |
---|---|---|---|---|---|---|---|---|
Algeria-1965 | 63 | 51 | 30 | 13 | 67 | 54 | 34 | 15 |
Cameroon-1964 | 34 | 29 | 13 | 5 | 38 | 32 | 17 | 6 |
Madagascar-1966 | 38 | 30 | 17 | 7 | 38 | 34 | 20 | 7 |
Mauritius-1966 | 59 | 42 | 20 | 6 | 64 | 46 | 25 | 8 |
Reunion-1963 | 56 | 38 | 18 | 7 | 62 | 46 | 25 | 10 |
Agglomerative clustering
# Compute hierarchical clustering
res.hc <- file07.data %>%
na.omit() %>% # omit NA value
scale() %>% # Scale the data
dist(method = "euclidean") %>% # Compute dissimilarity matrix
hclust(method = "ward.D2") # Compute hierachical clustering
# Visualize using factoextra
# Cut in eight groups and color by groups
fviz_dend(res.hc, k=8,# Cut in eight groups
cex = 0.5,# label size
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE # Add rectangle around groups
)
Divise clustering
res.diana <- diana(file07.data, stand = TRUE)
fviz_dend(res.diana, cex = 0.5,
k = 8, # Cut in eight groups
palette = "jco" # Color palette
)
To determining the optimal number of clusters, I use function NbClust() in package NbClust.
manhattan distance
library("NbClust")
res.nbclust <- file07.data %>%
scale() %>%
NbClust(distance = "manhattan",
min.nc = 2, max.nc = 15,
method = "kmeans", index ="all")
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 4 proposed 2 as the best number of clusters
## * 6 proposed 3 as the best number of clusters
## * 5 proposed 4 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
## * 2 proposed 11 as the best number of clusters
## * 3 proposed 13 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 3 .
euclidean distance
library("NbClust")
res.nbclust <- file07.data %>%
scale() %>%
NbClust(distance = "euclidean",
min.nc = 2, max.nc = 15,
method = "kmeans", index ="all")
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 3 proposed 2 as the best number of clusters
## * 6 proposed 3 as the best number of clusters
## * 5 proposed 4 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
## * 1 proposed 11 as the best number of clusters
## * 5 proposed 13 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 3 .
From the results above, we can find the optimal number of clusters under the euclidean and manhattan are both equal to 3.
kmeans(file07.data, centers = 3,nstart = 25) %>%
fviz_cluster(., data = file07.data,
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal())+
labs(subtitle = "K-means algorithm ")
# Compute PAM
file07.data %>%
pam(., 3,metric = "euclidean") %>%
# Visualize
fviz_cluster()+
labs(subtitle = "PAM algorithm with euclidean metric k=3 ")
# Compute PAM
file07.data %>%
pam(., 3,metric = "manhattan") %>%
# Visualize
fviz_cluster()+
labs(subtitle = "PAM algorithm with manhattan metric k=3 ")
To read “data” files, I use R function - read.table().
read.table('../datasets/iris.data',header = FALSE,
dec = '.',na.strings = 'NA',sep = ',') -> iris.data
names(iris.data)<-c('Sepal Length','Sepal Width','Petal Length','Petal Width','Class')
For dataset iris, variable descriptions are as follows:
The table below shown the first five lines of the data:
Sepal Length | Sepal Width | Petal Length | Petal Width | Class |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
Agglomerative clustering
# Compute hierarchical clustering
res.hc <- iris.data[1:4] %>%
na.omit() %>% # omit NA value
scale() %>% # Scale the data
dist(method = "euclidean") %>% # Compute dissimilarity matrix
hclust(method = "ward.D2") # Compute hierachical clustering
# Visualize using factoextra
# Cut in 3 groups and color by groups
fviz_dend(res.hc, k=3,# Cut in three groups
cex = 0.5,# label size
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE # Add rectangle around groups
)
Divise clustering
res.diana <- diana(iris.data[1:4], stand = TRUE)
fviz_dend(res.diana, cex = 0.5,
k = 3, # Cut in three groups
palette = "jco" # Color palette
)
To determining the optimal number of clusters, I use function NbClust() in package NbClust.
manhattan distance
library("NbClust")
res.nbclust <- iris.data[1:4] %>%
scale() %>%
NbClust(distance = "manhattan",
min.nc = 2, max.nc = 10,
method = "kmeans", index ="all")
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 12 proposed 2 as the best number of clusters
## * 9 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 2 .
euclidean distance
library("NbClust")
res.nbclust <- iris.data[1:4] %>%
scale() %>%
NbClust(distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans", index ="all")
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 11 proposed 2 as the best number of clusters
## * 10 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 1 proposed 6 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 2 .
From the results above, we can find the optimal number of clusters under the euclidean and manhattan are both equal to 2.
kmeans(iris.data[1:4], centers = 2,nstart = 25) %>%
fviz_cluster(., data = iris.data[1:4],
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal())+
labs(subtitle = "K-means algorithm ")
In fact, we known this dataset, it has three types of iris. Let’s try to set \(k=3\).
kmeans(iris.data[1:4], centers = 3,nstart = 25) ->iris.kmeans.res
fviz_cluster(iris.kmeans.res, data = iris.data[1:4],
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal())+
labs(subtitle = "K-means algorithm ")
# Compute PAM
iris.data[1:4] %>%
pam(., 2,metric = "euclidean") %>%
# Visualize
fviz_cluster()+
labs(subtitle = "PAM algorithm with euclidean metric k=2 ")
# Compute PAM
iris.data[1:4] %>%
pam(., 3,metric = "euclidean") %>%
# Visualize
fviz_cluster()+
labs(subtitle = "PAM algorithm with euclidean metric k=3 ")
# Compute PAM
iris.data %>%
pam(., 2,metric = "manhattan") %>%
# Visualize
fviz_cluster()+
labs(subtitle = "PAM algorithm with manhattan metric k=2 ")
# Compute PAM
iris.data %>%
pam(., 3,metric = "manhattan") -> iris.pam.res
# Visualize
fviz_cluster(iris.pam.res)+
labs(subtitle = "PAM algorithm with manhattan metric k=3 ")
It is not difficult to find that the performance of using the PAM algorithm with the manhattan metric (\(k=3\)) is slightly better than kmeans algorithm.
To visualize the results, I use R function ggpairs in package ggplot2. The results of the PAM algorithm with the manhattan metric (\(k=3\)) after visualization is as follows.
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The elements on the diagonal are the density maps of the three types of iris under the corresponding factors.
confusion matrix of PAM algorithm with the manhattan metric (\(k=3\))
##
## Iris-setosa Iris-versicolor Iris-virginica
## 1 50 0 0
## 2 0 49 1
## 3 0 1 49
\[ ACC=\frac{148}{150} \]
confusion matrix of kmeans algorithm with the euclidean metric (\(k=3\))
##
## Iris-setosa Iris-versicolor Iris-virginica
## 1 0 48 14
## 2 50 0 0
## 3 0 2 36
\[ ACC=\frac{134}{150} \]
Obviously, in the case where \(k=3\) the accuracy of PAM algorithm with the manhattan metric (\(k=3\)) is much higher than kmeans algorithm with the euclidean metric (\(k=3\)).
Thanks for knitr designed by(Xie 2015).
Thanks for the information on website (DATANOVA 2019) and (Jeff Chang 2019) for providing me with a lot of help.
Cock, Peter J. A., Tiago Antao, Jeffrey T. Chang, Brad A. Chapman, Cymon J. Cox, Andrew Dalke, Iddo Friedberg, et al. 2009. “Biopython: freely available Python tools for computational molecular biology and bioinformatics.” Bioinformatics 25 (11): 1422–3. https://doi.org/10.1093/bioinformatics/btp163.
DATANOVA. 2019. “TYPES of Clustering Methods: OVERVIEW and Quick Start R Code.” https://www.datanovia.com/en/blog/types-of-clustering-methods-overview-and-quick-start-r-code/.
Jeff Chang, Iddo Friedberg, Brad Chapman. 2019. “Biopython Tutorial and Cookbook.” http://biopython.org/DIST/docs/tutorial/Tutorial.html.
Parilina, Elena. 2020. “Lecture 7. Cluster Analysis. Applied Statistics in R.” Saint Petersburg State University; Lecture.
Theodoridis, Sergios, and Konstantinos Koutroumbas. 2008. Pattern Recognition, Fourth Edition. 4th ed. USA: Academic Press, Inc.
Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.