Measures

Metric (Dissimilarity measure)

Definition about Metric

Dissimilarity measure is a numerical measure of how different two data objects are.

A metric(Dissimilarity measure) on a set X is a function:

\[d: X \times X \rightarrow[0, \infty)\] where \([0,\infty )\) is the set of non-negative real numbers and for all \(x,y,z\in X\), the following conditions are satisfied:

  1. non-negativity: \(d(x,y)\ge 0\);
  2. identity: \(d(x, y)=0 \Leftrightarrow x=y\);
  3. symmetry: \(d(x,y)=d(y,x)\);
  4. triangle inequality: \(d(x, y) \leq d(x, z)+d(z, y)\).

(Theodoridis and Koutroumbas 2008)

Minkowski distance

Definition:

The Minkowski distance of order \(p\) between two points \(X\ \text{and}\ Y\), is define as \[ d(X,Y)=\bigg(\sum_{i=n}^{n}|x_i-y_i|^p\bigg)^{\frac{1}{p}} \] Where:

\[X=\left(x_{1}, x_{2}, \ldots, x_{n}\right) \text { and } Y=\left(y_{1}, y_{2}, \ldots, y_{n}\right) \in \mathbb{R}^{n}\] For \(p\ge 1\), the Minkowski distance is a metric. But the triangle inequality was breaked when \(p<1\). \(When p<1\), the distance between \((0,0)\) and \((1,1)\) is \(2^{1/p}>2\), but distance is \(1\) from point (0,1) to both of those points . Since this violates the triangle inequality, for \(p<1\), \(d\) is not a metric.

Special Cases:

  1. Manhattan(city block) distance: \(p=1\);
  2. Euclidean distance: \(p=2\);
  3. Chebyshev distance: \(p\rightarrow \infty\)

Similarity measure

Similarity measure is a numerical measure of how alike two data objects are. Often falls in the range [0,1].

Similarity measure \(s\) have the following properties:

  1. \(0\le s(x,y)\le 1\);
  2. \(s(x,y)=1,\ \text{iff}\ x=y\);
  3. Symmetry: for all \(x\) and \(y\) \(s(x,y)=s(y,x)\).

(Parilina 2020)

The popular method for similarity measure include:

  1. Pearson correlation coefficient;
  2. Uncentered correlation (cosine of the angle);
  3. Spearman rank correlation;
  4. Kendall’s \(\tau\);

(Cock et al. 2009)

The results obtained by the above method often fall in range \([-1,1]\), we have to adjust them into \([0,1]\).

Relationship between \(s\) and \(d\)

The distance corresponding \(d\) to similarity measure \(s\) is:

\[ d(x,y)=1-s(x,y) \]

The similarity corresponding \(s\) to Dissimilarity measure \(d\) is:

\[ s(x,y)=\frac{d(x,y)}{1+d(x,y)} \]

Load Packages

Data “file01.txt”

Read Data

To read “txt” files, I use R function - read.table().

For each data, descriptions are as follows:

  1. Name: the year of sighting and the initials of the astronomer.
  2. Node: the angle, in the earth’s plane of orbit, at which the minor planet crosses the earth’s orbit.
  3. Inclination: the angle between the orbits of the earth and the minor planet.
  4. Axis: the maximum distance between the minor planet and the sun, divided by the corresponding quantity for the earth.

The table below shown the first five lines of the data:

Node Inclination Axis
1935RF 130.916 4.659 2.2562
1941FD 132.200 4.700 2.1300
1955QT 130.070 4.790 2.1893
1940YL 338.333 16.773 2.7465
1953NH 339.625 16.067 2.7335

Hierarchical clustering

Hierarchical clustering does not require to pre-specify the number of clusters to be generated, it can be subdivided into two types:

  1. Agglomerative clustering in which, each observation is initially considered as a cluster of its own (leaf). Then, the most similar clusters are successively merged until there is just one single big cluster (root).
  2. Divise clustering, an inverse of agglomerative clustering, begins with the root, in witch all objects are included in one cluster. Then the most heterogeneous clusters are successively divided until all observation are in their own cluster.

Agglomerative clustering

Divise clustering

Determining the optimal number of clusters

To determining the optimal number of clusters, I use function NbClust() in package NbClust.

manhattan distance

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 7 proposed  2 as the best number of clusters
## * 6 proposed  3 as the best number of clusters
## * 1 proposed  7 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 2 proposed  9 as the best number of clusters
## * 7 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

euclidean distance

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 6 proposed  2 as the best number of clusters
## * 6 proposed  3 as the best number of clusters
## * 1 proposed  7 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 2 proposed  9 as the best number of clusters
## * 8 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  10 .

From the results above, we can find the optimal number of clusters under the euclidean and manhattan are equal to 10 and 2 respectively.

PAM(K-Medoids) algorithm

The k-medoids algorithm is a clustering approach related to k-means clustering for partitioning a data set into k groups or clusters. But \(pam()\) function in R can customize metric, “euclidean” or “manhattan”. I use this function to try different distance methods.

Data “file02.txt”

Read Data

To read “txt” files, I use R function - read.table().

For each data, descriptions are as follows:

  1. Name: animal.
  2. Water: the percentage of water.
  3. Protein: the percentage of protein.
  4. Fat: the percentage of fat.
  5. Lactose: the percentage of lactose.

The table below shown the first five lines of the data:

Water.. Protein.. Fat.. Lactose..
Horse 90.1 2.6 1.0 6.9
Donkey 90.3 1.7 1.4 6.2
Mule 90.0 2.0 1.8 5.5
Camel 87.7 3.5 3.4 4.8
Llama 86.5 3.9 3.2 5.6

Determining the optimal number of clusters

To determining the optimal number of clusters, I use function NbClust() in package NbClust.

manhattan distance

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 7 proposed  2 as the best number of clusters
## * 6 proposed  3 as the best number of clusters
## * 1 proposed  4 as the best number of clusters
## * 3 proposed  7 as the best number of clusters
## * 3 proposed  9 as the best number of clusters
## * 4 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

euclidean distance

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 7 proposed  2 as the best number of clusters
## * 6 proposed  3 as the best number of clusters
## * 1 proposed  4 as the best number of clusters
## * 3 proposed  7 as the best number of clusters
## * 3 proposed  9 as the best number of clusters
## * 4 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

From the results above, we can find the optimal number of clusters under the euclidean and manhattan are both equal to 2.

Data “file07.txt”

Read Data

To read “txt” files, I use R function - read.table().

For each data, descriptions are as follows:

  1. Country: the name of the country.
  2. Year: the year in which the data was computed.
  3. M0, M25, M50 and M75 are the remaining life expectancies for a male of ages 0, 25, 50 and 75. 4: F0, F25, F50 and F75 are the remaining life expectancies for a female of ages 0, 25, 50 and 75.

The table below shown the first five lines of the data:

M0 M25 M50 M75 F0 F25 F50 F75
Algeria-1965 63 51 30 13 67 54 34 15
Cameroon-1964 34 29 13 5 38 32 17 6
Madagascar-1966 38 30 17 7 38 34 20 7
Mauritius-1966 59 42 20 6 64 46 25 8
Reunion-1963 56 38 18 7 62 46 25 10

Determining the optimal number of clusters

To determining the optimal number of clusters, I use function NbClust() in package NbClust.

manhattan distance

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 4 proposed  2 as the best number of clusters
## * 6 proposed  3 as the best number of clusters
## * 5 proposed  4 as the best number of clusters
## * 1 proposed  6 as the best number of clusters
## * 1 proposed  9 as the best number of clusters
## * 1 proposed  10 as the best number of clusters
## * 2 proposed  11 as the best number of clusters
## * 3 proposed  13 as the best number of clusters
## * 1 proposed  15 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  3 .

euclidean distance

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 3 proposed  2 as the best number of clusters
## * 6 proposed  3 as the best number of clusters
## * 5 proposed  4 as the best number of clusters
## * 1 proposed  6 as the best number of clusters
## * 1 proposed  9 as the best number of clusters
## * 1 proposed  10 as the best number of clusters
## * 1 proposed  11 as the best number of clusters
## * 5 proposed  13 as the best number of clusters
## * 1 proposed  15 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  3 .

From the results above, we can find the optimal number of clusters under the euclidean and manhattan are both equal to 3.

Data “iris.data”

Read Data

To read “data” files, I use R function - read.table().

For dataset iris, variable descriptions are as follows:

  1. sepal length in cm
  2. sepal width in cm
  3. petal length in cm
  4. petal width in cm
  5. class

The table below shown the first five lines of the data:

Sepal Length Sepal Width Petal Length Petal Width Class
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5.0 3.6 1.4 0.2 Iris-setosa

Determining the optimal number of clusters

To determining the optimal number of clusters, I use function NbClust() in package NbClust.

manhattan distance

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 12 proposed  2 as the best number of clusters
## * 9 proposed  3 as the best number of clusters
## * 1 proposed  4 as the best number of clusters
## * 1 proposed  6 as the best number of clusters
## * 1 proposed  9 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

euclidean distance

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 11 proposed  2 as the best number of clusters
## * 10 proposed  3 as the best number of clusters
## * 1 proposed  4 as the best number of clusters
## * 1 proposed  6 as the best number of clusters
## * 1 proposed  9 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

From the results above, we can find the optimal number of clusters under the euclidean and manhattan are both equal to 2.

PAM(K-Medoids) algorithm

It is not difficult to find that the performance of using the PAM algorithm with the manhattan metric (\(k=3\)) is slightly better than kmeans algorithm.

Visualizing Data

To visualize the results, I use R function ggpairs in package ggplot2. The results of the PAM algorithm with the manhattan metric (\(k=3\)) after visualization is as follows.

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The elements on the diagonal are the density maps of the three types of iris under the corresponding factors.

Construct the confusion matrix to test the results of clusterization.

confusion matrix of PAM algorithm with the manhattan metric (\(k=3\))

##    
##     Iris-setosa Iris-versicolor Iris-virginica
##   1          50               0              0
##   2           0              49              1
##   3           0               1             49

\[ ACC=\frac{148}{150} \]

confusion matrix of kmeans algorithm with the euclidean metric (\(k=3\))

##    
##     Iris-setosa Iris-versicolor Iris-virginica
##   1           0              48             14
##   2          50               0              0
##   3           0               2             36

\[ ACC=\frac{134}{150} \]

Obviously, in the case where \(k=3\) the accuracy of PAM algorithm with the manhattan metric (\(k=3\)) is much higher than kmeans algorithm with the euclidean metric (\(k=3\)).

Acknowledgements

Thanks for knitr designed by(Xie 2015).

Thanks for the information on website (DATANOVA 2019) and (Jeff Chang 2019) for providing me with a lot of help.

References

Cock, Peter J. A., Tiago Antao, Jeffrey T. Chang, Brad A. Chapman, Cymon J. Cox, Andrew Dalke, Iddo Friedberg, et al. 2009. “Biopython: freely available Python tools for computational molecular biology and bioinformatics.” Bioinformatics 25 (11): 1422–3. https://doi.org/10.1093/bioinformatics/btp163.

DATANOVA. 2019. “TYPES of Clustering Methods: OVERVIEW and Quick Start R Code.” https://www.datanovia.com/en/blog/types-of-clustering-methods-overview-and-quick-start-r-code/.

Jeff Chang, Iddo Friedberg, Brad Chapman. 2019. “Biopython Tutorial and Cookbook.” http://biopython.org/DIST/docs/tutorial/Tutorial.html.

Parilina, Elena. 2020. “Lecture 7. Cluster Analysis. Applied Statistics in R.” Saint Petersburg State University; Lecture.

Theodoridis, Sergios, and Konstantinos Koutroumbas. 2008. Pattern Recognition, Fourth Edition. 4th ed. USA: Academic Press, Inc.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. http://yihui.name/knitr/.