Measures

Metric (Dissimilarity measure)

Definition about Metric

Dissimilarity measure is a numerical measure of how different two data objects are.

A metric(Dissimilarity measure) on a set X is a function:

\[d: X \times X \rightarrow[0, \infty)\] where \([0,\infty )\) is the set of non-negative real numbers and for all \(x,y,z\in X\), the following conditions are satisfied:

non-negativity: \(d(x,y)\ge 0\);
identity: \(d(x, y)=0 \Leftrightarrow x=y\);
symmetry: \(d(x,y)=d(y,x)\);
triangle inequality: \(d(x, y) \leq d(x, z)+d(z, y)\).

(Theodoridis and Koutroumbas 2008)

Minkowski distance

Definition:

The Minkowski distance of order \(p\) between two points \(X\ \text{and}\ Y\), is define as \[ d(X,Y)=\bigg(\sum_{i=n}^{n}|x_i-y_i|^p\bigg)^{\frac{1}{p}} \] Where:

\[X=\left(x_{1}, x_{2}, \ldots, x_{n}\right) \text { and } Y=\left(y_{1}, y_{2}, \ldots, y_{n}\right) \in \mathbb{R}^{n}\] For \(p\ge 1\), the Minkowski distance is a metric. But the triangle inequality was breaked when \(p<1\). \(When p<1\), the distance between \((0,0)\) and \((1,1)\) is \(2^{1/p}>2\), but distance is \(1\) from point (0,1) to both of those points . Since this violates the triangle inequality, for \(p<1\), \(d\) is not a metric.

Special Cases:

Manhattan(city block) distance: \(p=1\);
Euclidean distance: \(p=2\);
Chebyshev distance: \(p\rightarrow \infty\)

Similarity measure

Similarity measure is a numerical measure of how alike two data objects are. Often falls in the range [0,1].

Similarity measure \(s\) have the following properties:

\(0\le s(x,y)\le 1\);
\(s(x,y)=1,\ \text{iff}\ x=y\);
Symmetry: for all \(x\) and \(y\) \(s(x,y)=s(y,x)\).

(Parilina 2020)

The popular method for similarity measure include:

Pearson correlation coefficient;
Uncentered correlation (cosine of the angle);
Spearman rank correlation;
Kendall’s \(\tau\);

(Cock et al. 2009)

The results obtained by the above method often fall in range \([-1,1]\), we have to adjust them into \([0,1]\).

Relationship between \(s\) and \(d\)

The distance corresponding \(d\) to similarity measure \(s\) is:

\[ d(x,y)=1-s(x,y) \]

The similarity corresponding \(s\) to Dissimilarity measure \(d\) is:

\[ s(x,y)=\frac{d(x,y)}{1+d(x,y)} \]

Load Packages

library("cluster")
library("ggplot2")
library("factoextra")
library("magrittr")
library("NbClust")

Data “file01.txt”

Read Data

To read “txt” files, I use R function - read.table().

read.table('../datasets/file01.txt',header = TRUE,
           na.strings = 'NA',row.names = 1,dec='.') -> file01.data
# factor to character
lapply(file01.data,function(x) as.numeric(as.character(x)))-> file01.data[]

For each data, descriptions are as follows:

Name: the year of sighting and the initials of the astronomer.
Node: the angle, in the earth’s plane of orbit, at which the minor planet crosses the earth’s orbit.
Inclination: the angle between the orbits of the earth and the minor planet.
Axis: the maximum distance between the minor planet and the sun, divided by the corresponding quantity for the earth.

The table below shown the first five lines of the data:

	Node	Inclination	Axis
1935RF	130.916	4.659	2.2562
1941FD	132.200	4.700	2.1300
1955QT	130.070	4.790	2.1893
1940YL	338.333	16.773	2.7465
1953NH	339.625	16.067	2.7335

Hierarchical clustering

Hierarchical clustering does not require to pre-specify the number of clusters to be generated, it can be subdivided into two types:

Agglomerative clustering in which, each observation is initially considered as a cluster of its own (leaf). Then, the most similar clusters are successively merged until there is just one single big cluster (root).
Divise clustering, an inverse of agglomerative clustering, begins with the root, in witch all objects are included in one cluster. Then the most heterogeneous clusters are successively divided until all observation are in their own cluster.

Agglomerative clustering

# Compute hierarchical clustering
res.hc <- file01.data %>%
  na.omit() %>%                  # omit NA value
  scale() %>%                    # Scale the data
  dist(method = "euclidean") %>% # Compute dissimilarity matrix
  hclust(method = "ward.D2")     # Compute hierachical clustering
# Visualize using factoextra
# Cut in 4 groups and color by groups
fviz_dend(res.hc, k=4,# Cut in four groups
          cex = 0.5,# label size
          k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE # Add rectangle around groups
          ,repel = T,
          rect_fill = T
          )

Divise clustering

res.diana <- diana(file01.data, stand = TRUE)
fviz_dend(res.diana, cex = 0.5,
          k = 4, # Cut in four groups
          palette = "jco" # Color palette
          )

Determining the optimal number of clusters

To determining the optimal number of clusters, I use function NbClust() in package NbClust.

manhattan distance

# library("NbClust")
res.nbclust <- file01.data %>%
  scale() %>%
  NbClust(distance = "manhattan",
          min.nc = 2, max.nc = 10,
          method = "kmeans", index ="all")

fviz_nbclust(res.nbclust, ggtheme = theme_minimal())

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 7 proposed  2 as the best number of clusters
## * 6 proposed  3 as the best number of clusters
## * 1 proposed  7 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 2 proposed  9 as the best number of clusters
## * 7 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

euclidean distance

library("NbClust")
res.nbclust <- file01.data %>%
  scale() %>%
  NbClust(distance = "euclidean",
          min.nc = 2, max.nc = 10,
          method = "kmeans", index ="all")

fviz_nbclust(res.nbclust, ggtheme = theme_minimal())

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 6 proposed  2 as the best number of clusters
## * 6 proposed  3 as the best number of clusters
## * 1 proposed  7 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 2 proposed  9 as the best number of clusters
## * 8 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  10 .

From the results above, we can find the optimal number of clusters under the euclidean and manhattan are equal to 10 and 2 respectively.

K-means Algorithm

Since kmeans() function in R using euclidean metric. I set \(\text{centers}=2\).

kmeans(file01.data, centers = 10,nstart = 25) %>%
fviz_cluster(., data = file01.data,
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_minimal())+
  labs(subtitle = "K-means algorithm ")

PAM(K-Medoids) algorithm

The k-medoids algorithm is a clustering approach related to k-means clustering for partitioning a data set into k groups or clusters. But \(pam()\) function in R can customize metric, “euclidean” or “manhattan”. I use this function to try different distance methods.

# Compute PAM
file01.data %>%
pam(., 10,metric = "euclidean") %>%
# Visualize
fviz_cluster()+
  labs(subtitle = "PAM algorithm with euclidean metric k=10 ")

# Compute PAM
file01.data %>%
pam(., 2,metric = "manhattan") %>%
# Visualize
fviz_cluster()+
  labs(subtitle = "PAM algorithm with manhattan metric k=2 ")

Data “file02.txt”

Read Data

To read “txt” files, I use R function - read.table().

read.table('../datasets/file02.txt',header = TRUE,
           dec = '.',na.strings = 'NA',row.names = 1) -> file02.data
# factor to character
lapply(file02.data,function(x) as.numeric(as.character(x)))-> file02.data[]

For each data, descriptions are as follows:

Name: animal.
Water: the percentage of water.
Protein: the percentage of protein.
Fat: the percentage of fat.
Lactose: the percentage of lactose.

The table below shown the first five lines of the data:

	Water..	Protein..	Fat..	Lactose..
Horse	90.1	2.6	1.0	6.9
Donkey	90.3	1.7	1.4	6.2
Mule	90.0	2.0	1.8	5.5
Camel	87.7	3.5	3.4	4.8
Llama	86.5	3.9	3.2	5.6

Hierarchical clustering

Agglomerative clustering

# Compute hierarchical clustering
res.hc <- file02.data %>%
  na.omit() %>%                  # omit NA value
  scale() %>%                    # Scale the data
  dist(method = "euclidean") %>% # Compute dissimilarity matrix
  hclust(method = "ward.D2")     # Compute hierachical clustering
# Visualize using factoextra
# Cut in 4 groups and color by groups
fviz_dend(res.hc, k=4,# Cut in four groups
          cex = 0.5,# label size
          k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE # Add rectangle around groups
          )

Divise clustering

res.diana <- diana(file02.data, stand = TRUE)
fviz_dend(res.diana, cex = 0.5,
          k = 4, # Cut in four groups
          palette = "jco" # Color palette
          )

Determining the optimal number of clusters

To determining the optimal number of clusters, I use function NbClust() in package NbClust.

manhattan distance

library("NbClust")
res.nbclust <- file02.data %>%
  scale() %>%
  NbClust(distance = "manhattan",
          min.nc = 2, max.nc = 10,
          method = "kmeans", index ="all")

fviz_nbclust(res.nbclust, ggtheme = theme_minimal())

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 7 proposed  2 as the best number of clusters
## * 6 proposed  3 as the best number of clusters
## * 1 proposed  4 as the best number of clusters
## * 3 proposed  7 as the best number of clusters
## * 3 proposed  9 as the best number of clusters
## * 4 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

euclidean distance

library("NbClust")
res.nbclust <- file02.data %>%
  scale() %>%
  NbClust(distance = "euclidean",
          min.nc = 2, max.nc = 10,
          method = "kmeans", index ="all")

fviz_nbclust(res.nbclust, ggtheme = theme_minimal())

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 7 proposed  2 as the best number of clusters
## * 6 proposed  3 as the best number of clusters
## * 1 proposed  4 as the best number of clusters
## * 3 proposed  7 as the best number of clusters
## * 3 proposed  9 as the best number of clusters
## * 4 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

From the results above, we can find the optimal number of clusters under the euclidean and manhattan are both equal to 2.

K-means Algorithm

kmeans(file02.data, centers = 2,nstart = 25) %>%
fviz_cluster(., data = file02.data,
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_minimal())+
  labs(subtitle = "K-means algorithm ")

PAM(K-Medoids) algorithm

# Compute PAM
file02.data %>%
pam(., 2,metric = "euclidean") %>%
# Visualize
fviz_cluster()+
  labs(subtitle = "PAM algorithm with euclidean metric k=2 ")

# Compute PAM
file02.data %>%
pam(., 2,metric = "manhattan") %>%
# Visualize
fviz_cluster()+
  labs(subtitle = "PAM algorithm with manhattan metric k=2 ")

Data “file07.txt”

Read Data

To read “txt” files, I use R function - read.table().

read.table('../datasets/file07.txt',header = TRUE,
           dec = '.',na.strings = 'NA') -> file07.data
# build unique label for each data
row.names(file07.data)<-paste(file07.data$Country,file07.data$Year,sep='-') 
file07.data[-c(1,2)]->file07.data
# factor to character
lapply(file07.data,function(x) as.numeric(as.character(x)))-> file07.data[]

For each data, descriptions are as follows:

Country: the name of the country.
Year: the year in which the data was computed.
M0, M25, M50 and M75 are the remaining life expectancies for a male of ages 0, 25, 50 and 75. 4: F0, F25, F50 and F75 are the remaining life expectancies for a female of ages 0, 25, 50 and 75.

The table below shown the first five lines of the data:

	M0	M25	M50	M75	F0	F25	F50	F75
Algeria-1965	63	51	30	13	67	54	34	15
Cameroon-1964	34	29	13	5	38	32	17	6
Madagascar-1966	38	30	17	7	38	34	20	7
Mauritius-1966	59	42	20	6	64	46	25	8
Reunion-1963	56	38	18	7	62	46	25	10

Hierarchical clustering

Agglomerative clustering

# Compute hierarchical clustering
res.hc <- file07.data %>%
  na.omit() %>%                  # omit NA value
  scale() %>%                    # Scale the data
  dist(method = "euclidean") %>% # Compute dissimilarity matrix
  hclust(method = "ward.D2")     # Compute hierachical clustering
# Visualize using factoextra
# Cut in eight groups and color by groups
fviz_dend(res.hc, k=8,# Cut in eight groups
          cex = 0.5,# label size
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE # Add rectangle around groups
          )

Divise clustering

res.diana <- diana(file07.data, stand = TRUE)
fviz_dend(res.diana, cex = 0.5,
          k = 8, # Cut in eight groups
          palette = "jco" # Color palette
          )

Determining the optimal number of clusters

To determining the optimal number of clusters, I use function NbClust() in package NbClust.

manhattan distance

library("NbClust")
res.nbclust <- file07.data %>%
  scale() %>%
  NbClust(distance = "manhattan",
          min.nc = 2, max.nc = 15,
          method = "kmeans", index ="all")

fviz_nbclust(res.nbclust, ggtheme = theme_minimal())

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 4 proposed  2 as the best number of clusters
## * 6 proposed  3 as the best number of clusters
## * 5 proposed  4 as the best number of clusters
## * 1 proposed  6 as the best number of clusters
## * 1 proposed  9 as the best number of clusters
## * 1 proposed  10 as the best number of clusters
## * 2 proposed  11 as the best number of clusters
## * 3 proposed  13 as the best number of clusters
## * 1 proposed  15 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  3 .

euclidean distance

library("NbClust")
res.nbclust <- file07.data %>%
  scale() %>%
  NbClust(distance = "euclidean",
          min.nc = 2, max.nc = 15,
          method = "kmeans", index ="all")

fviz_nbclust(res.nbclust, ggtheme = theme_minimal())

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 3 proposed  2 as the best number of clusters
## * 6 proposed  3 as the best number of clusters
## * 5 proposed  4 as the best number of clusters
## * 1 proposed  6 as the best number of clusters
## * 1 proposed  9 as the best number of clusters
## * 1 proposed  10 as the best number of clusters
## * 1 proposed  11 as the best number of clusters
## * 5 proposed  13 as the best number of clusters
## * 1 proposed  15 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  3 .

From the results above, we can find the optimal number of clusters under the euclidean and manhattan are both equal to 3.

K-means Algorithm

kmeans(file07.data, centers = 3,nstart = 25) %>%
fviz_cluster(., data = file07.data,
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_minimal())+
  labs(subtitle = "K-means algorithm ")

PAM(K-Medoids) algorithm

# Compute PAM
file07.data %>%
pam(., 3,metric = "euclidean") %>%
# Visualize
fviz_cluster()+
  labs(subtitle = "PAM algorithm with euclidean metric k=3 ")

# Compute PAM
file07.data %>%
pam(., 3,metric = "manhattan") %>%
# Visualize
fviz_cluster()+
  labs(subtitle = "PAM algorithm with manhattan metric k=3 ")

Data “iris.data”

Read Data

To read “data” files, I use R function - read.table().

read.table('../datasets/iris.data',header = FALSE,
           dec = '.',na.strings = 'NA',sep = ',') -> iris.data
names(iris.data)<-c('Sepal Length','Sepal Width','Petal Length','Petal Width','Class')

For dataset iris, variable descriptions are as follows:

sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
class

The table below shown the first five lines of the data:

Sepal Length	Sepal Width	Petal Length	Petal Width	Class
5.1	3.5	1.4	0.2	Iris-setosa
4.9	3.0	1.4	0.2	Iris-setosa
4.7	3.2	1.3	0.2	Iris-setosa
4.6	3.1	1.5	0.2	Iris-setosa
5.0	3.6	1.4	0.2	Iris-setosa

Hierarchical clustering

Agglomerative clustering

# Compute hierarchical clustering
res.hc <- iris.data[1:4] %>%
  na.omit() %>%                  # omit NA value
  scale() %>%                    # Scale the data
  dist(method = "euclidean") %>% # Compute dissimilarity matrix
  hclust(method = "ward.D2")     # Compute hierachical clustering
# Visualize using factoextra
# Cut in 3 groups and color by groups
fviz_dend(res.hc, k=3,# Cut in three groups
          cex = 0.5,# label size
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE # Add rectangle around groups
          )

Divise clustering

res.diana <- diana(iris.data[1:4], stand = TRUE)
fviz_dend(res.diana, cex = 0.5,
          k = 3, # Cut in three groups
          palette = "jco" # Color palette
          )

Determining the optimal number of clusters

To determining the optimal number of clusters, I use function NbClust() in package NbClust.

manhattan distance

library("NbClust")
res.nbclust <- iris.data[1:4] %>%
  scale() %>%
  NbClust(distance = "manhattan",
          min.nc = 2, max.nc = 10,
          method = "kmeans", index ="all")

fviz_nbclust(res.nbclust, ggtheme = theme_minimal())

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 12 proposed  2 as the best number of clusters
## * 9 proposed  3 as the best number of clusters
## * 1 proposed  4 as the best number of clusters
## * 1 proposed  6 as the best number of clusters
## * 1 proposed  9 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

euclidean distance

library("NbClust")
res.nbclust <- iris.data[1:4] %>%
  scale() %>%
  NbClust(distance = "euclidean",
          min.nc = 2, max.nc = 10,
          method = "kmeans", index ="all")

fviz_nbclust(res.nbclust, ggtheme = theme_minimal())

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 11 proposed  2 as the best number of clusters
## * 10 proposed  3 as the best number of clusters
## * 1 proposed  4 as the best number of clusters
## * 1 proposed  6 as the best number of clusters
## * 1 proposed  9 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

From the results above, we can find the optimal number of clusters under the euclidean and manhattan are both equal to 2.

K-means Algorithm

kmeans(iris.data[1:4], centers = 2,nstart = 25) %>%
fviz_cluster(., data = iris.data[1:4],
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_minimal())+
  labs(subtitle = "K-means algorithm ")

In fact, we known this dataset, it has three types of iris. Let’s try to set \(k=3\).

kmeans(iris.data[1:4], centers = 3,nstart = 25) ->iris.kmeans.res
fviz_cluster(iris.kmeans.res, data = iris.data[1:4],
             ellipse.type = "convex",
             palette = "jco",
             ggtheme = theme_minimal())+
  labs(subtitle = "K-means algorithm ")

PAM(K-Medoids) algorithm

# Compute PAM
iris.data[1:4] %>%
pam(., 2,metric = "euclidean") %>%
# Visualize
fviz_cluster()+
  labs(subtitle = "PAM algorithm with euclidean metric k=2 ")

# Compute PAM
iris.data[1:4] %>%
pam(., 3,metric = "euclidean") %>%
# Visualize
fviz_cluster()+
  labs(subtitle = "PAM algorithm with euclidean metric k=3 ")

# Compute PAM
iris.data %>%
pam(., 2,metric = "manhattan") %>%
# Visualize
fviz_cluster()+
  labs(subtitle = "PAM algorithm with manhattan metric k=2 ")

# Compute PAM
iris.data %>%
pam(., 3,metric = "manhattan")  -> iris.pam.res
# Visualize
fviz_cluster(iris.pam.res)+
  labs(subtitle = "PAM algorithm with manhattan metric k=3 ")

It is not difficult to find that the performance of using the PAM algorithm with the manhattan metric (\(k=3\)) is slightly better than kmeans algorithm.

Visualizing Data

To visualize the results, I use R function ggpairs in package ggplot2. The results of the PAM algorithm with the manhattan metric (\(k=3\)) after visualization is as follows.

library(ggplot2)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(iris,columns = 1:5,mapping=aes(colour=as.character(iris.pam.res$cluster)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The elements on the diagonal are the density maps of the three types of iris under the corresponding factors.

Construct the confusion matrix to test the results of clusterization.

confusion matrix of PAM algorithm with the manhattan metric (\(k=3\))

table(iris.pam.res$cluster,iris.data$Class)

##    
##     Iris-setosa Iris-versicolor Iris-virginica
##   1          50               0              0
##   2           0              49              1
##   3           0               1             49

\[ ACC=\frac{148}{150} \]

confusion matrix of kmeans algorithm with the euclidean metric (\(k=3\))

table(iris.kmeans.res$cluster,iris.data$Class)

##    
##     Iris-setosa Iris-versicolor Iris-virginica
##   1           0              48             14
##   2          50               0              0
##   3           0               2             36

\[ ACC=\frac{134}{150} \]

Obviously, in the case where \(k=3\) the accuracy of PAM algorithm with the manhattan metric (\(k=3\)) is much higher than kmeans algorithm with the euclidean metric (\(k=3\)).

Acknowledgements

Thanks for knitr designed by(Xie 2015).

Thanks for the information on website (DATANOVA 2019) and (Jeff Chang 2019) for providing me with a lot of help.

References

Cock, Peter J. A., Tiago Antao, Jeffrey T. Chang, Brad A. Chapman, Cymon J. Cox, Andrew Dalke, Iddo Friedberg, et al. 2009. “Biopython: freely available Python tools for computational molecular biology and bioinformatics.” Bioinformatics 25 (11): 1422–3. https://doi.org/10.1093/bioinformatics/btp163.

DATANOVA. 2019. “TYPES of Clustering Methods: OVERVIEW and Quick Start R Code.” https://www.datanovia.com/en/blog/types-of-clustering-methods-overview-and-quick-start-r-code/.

Jeff Chang, Iddo Friedberg, Brad Chapman. 2019. “Biopython Tutorial and Cookbook.” http://biopython.org/DIST/docs/tutorial/Tutorial.html.

Parilina, Elena. 2020. “Lecture 7. Cluster Analysis. Applied Statistics in R.” Saint Petersburg State University; Lecture.

Theodoridis, Sergios, and Konstantinos Koutroumbas. 2008. Pattern Recognition, Fourth Edition. 4th ed. USA: Academic Press, Inc.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.

Lecture 7 - Cluster Analysis - Zhao Chi - 19.M09

Zhao Chi

Measures

Metric (Dissimilarity measure)

Definition about Metric

Minkowski distance

Similarity measure

Relationship between \(s\) and \(d\)

Load Packages

Data “file01.txt”

Read Data

Hierarchical clustering

Determining the optimal number of clusters

K-means Algorithm

PAM(K-Medoids) algorithm

Data “file02.txt”

Read Data

Hierarchical clustering

Determining the optimal number of clusters

K-means Algorithm

PAM(K-Medoids) algorithm

Data “file07.txt”

Read Data

Hierarchical clustering

Determining the optimal number of clusters

K-means Algorithm

PAM(K-Medoids) algorithm

Data “iris.data”

Read Data

Hierarchical clustering

Determining the optimal number of clusters

K-means Algorithm

PAM(K-Medoids) algorithm

Visualizing Data

Construct the confusion matrix to test the results of clusterization.

Acknowledgements

References