Clustering with R

Data

## install.packages("NbClust")
## install.packages("reshape2")
## install.packages("viridis")
## install.packages("factoextra")
## install.packages("cluster")
## install.packages("magrittr")

library(factoextra)
library(cluster)
library(magrittr)
library(Wu)

# Load  and prepare the data
data("USArrests")

my_data <- USArrests %>%
  na.omit() %>%          # Remove missing values (NA)
    scale() %>%               # Scale variables
    as.data.table()

my_data %>% DT()

Data Summary

## colnames(my_data)

Vars <- colnames(my_data)

FactorVars <- NULL

t <- Table1n(obj = my_data, Vars = Vars, FactorVars = FactorVars)

t %>% prt()

what	level	Overall	Overall	Missing
n		50	50
Murder	mean (SD)	0.0 (1.0)	0.0 (1.0)	0.0
	median [IQR]	-0.1 [-0.9, 0.8]	-0.1 [-0.9, 0.8]	0.0
	median [range]	-0.1 [-1.6, 2.2]	-0.1 [-1.6, 2.2]	0.0
Assault	mean (SD)	0.0 (1.0)	0.0 (1.0)	0.0
	median [IQR]	-0.1 [-0.7, 0.9]	-0.1 [-0.7, 0.9]	0.0
	median [range]	-0.1 [-1.5, 2.0]	-0.1 [-1.5, 2.0]	0.0
UrbanPop	mean (SD)	0.0 (1.0)	0.0 (1.0)	0.0
	median [IQR]	0.0 [-0.8, 0.8]	0.0 [-0.8, 0.8]	0.0
	median [range]	0.0 [-2.3, 1.8]	0.0 [-2.3, 1.8]	0.0
Rape	mean (SD)	0.0 (1.0)	0.0 (1.0)	0.0
	median [IQR]	-0.1 [-0.7, 0.5]	-0.1 [-0.7, 0.5]	0.0
	median [range]	-0.1 [-1.5, 2.6]	-0.1 [-1.5, 2.6]	0.0

Distance

Distance by four continuous variables
Methods for Measuring Distance
Fred Szabo: Manhattan Distance
Euclidean Distance \[d_{euc}(x,y)=\sqrt{\sum_{i=1}^n(x_i - y_i)^2}\]
Manhattan Distance is the distance a car would drive in a city (Manhattan). It is the sum of absolute differences. It is also known as \(L^1\) norm. \[d_{man}(x,y)=\sum_{i=1}^n|(x_i - y_i)|\]
Pearson correlation distance: it measures the degree of a linear relationship between two profiles. It is another type of dissimilarity measuremeants called correlation-based distance. r values from -1 to 1. It could be converted to range of 0 to 1. \[d_{cor}(x,y) = 1- \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2}\sum_{i=1}^n(y_i - \bar{y})^2}\] \[d = (1 - r)/2\]
Spearman correlation distance: computes the correlation between the rank of x and y.
Kendall Correlation Distance. \[d_{kend}(x,y) = 1 - \frac{n_c - c_d}{n(n-1)/2}\] \[n_c : total\ number\ of\ concordant\ pairs\] \[n_d:\ total\ number\ of\ discordant\ pairs\] \[n(n-1)/2 = total\ number\ of\ possible\ pairings\]

library(factoextra)
res.dist <- get_dist(USArrests, stand = TRUE, method = "manhattan")

fviz_dist(res.dist, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

Gower Dissimilarity

Gower's dissimilarity measure for mixed numeric/categorical data
The overall similarity score is the sum of all the individual variable similarity divided by the total possible comparisons.

library(cluster)

Dist <- daisy(iris, metric = c("gower"))
library(factoextra)
fviz_dist(Dist)

## hclust(Dist)

Silhouette

Silhouette measures the consistency within clusters of data
Its value ranges from -1 to +1, high value indicates the object is well matched to its own cluster and poorly matched to nerighoring clusters.
a: Mean distance within cluster for i and all other data points.
b: smallest mean distance of i to all points in any other cluster. The cluster with this smallest mean dissimilarity is said to be the neighboring cluster.
silhouette values is \(\frac{b - a}{max(a,b}\)
score is zero if cluster size = 1
pam: partition around medoids function
Medoids are representative object within a cluster. It is similar in concept to means or centroids, but medoids are always restricted to be members of the cluster.

## silhouette plot
pamx <- pam(Dist, 3)
sil = silhouette(pamx$clustering, Dist)
plot(sil)

Kmeans Clustering

library("factoextra")

set.seed(123)
fviz_nbclust(my_data, kmeans, nstart = 10, method = "gap_stat", nboot = 500) + labs(subtitle = "Gap statistic method")

Plot Kmeans Clusters

set.seed(123)
km.res <- kmeans(my_data, 3, nstart = 25)
# Visualize
library("factoextra")
fviz_cluster(km.res, data = my_data, ellipse.type = "convex", palette = "jco", ggtheme = theme_minimal())

## GAP Statistic

[[https://statweb.stanford.edu/~gwalther/gap][GAP Statist]]

To estimate the optimal number of clusters. Provide a statistical procedure to formalize that heuristics.

D: sum of the pairwise distances for all points in cluster r W: Sum of cluster mean of D. If D is the squared Euclidean distance, then W is the pooled within-cluster sum of squared around the cluster means.

[[https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/#:~:text=The%2520gap%2520statistic%2520compares%2520the,yields%2520the%2520largest%2520gap%2520statistic).][Gap Statistic Method]]

The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optiomal clusters will be valued that maximize the gap statistics (that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points.

1, calculate within-cluster variation W (observed) 2, Generate B reference data sets with a random uniform distribution. calculate W (expected) 3, Compute the estimate gap statistic as the deviation of the observed W from expected W 4, Choose the smallest value of cluster number k, and the gap statistic is within one standard deviation of the gap at k+1

Hierarchical Clustering

Cluster Dendrogram
A dendrogram is a diagram that shows the hierarchical relationship between objects.

# Compute hierarchical clustering
library(Wu)
res.hc <- USArrests %>%
  scale() %>%                    # Scale the data
  dist(method = "euclidean") %>% # Compute dissimilarity matrix
  hclust(method = "ward.D2")     # Compute hierachical clustering

# Visualize using factoextra
# Cut in 4 groups and color by groups
fviz_dend(res.hc, k = 4, # Cut in four groups
          cex = 0.5, # label size
          k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE # Add rectangle around groups
          )

Determine the Optimal Number of Clusters

set.seed(123)

# Compute
library("NbClust")

res.nbclust <- USArrests %>% scale() %>% NbClust(distance = "euclidean", min.nc = 2, 
    max.nc = 10, method = "complete", index = "all")

*** : The Hubert index is a graphical method of determining the number of clusters. In the plot of Hubert index, we seek a significant knee that corresponds to a significant increase of the value of the measure i.e the significant peak in Hubert index second differences plot.

*** : The D index is a graphical method of determining the number of clusters. In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase of the value of the measure.

Among all indices:
9 proposed 2 as the best number of clusters
4 proposed 3 as the best number of clusters
6 proposed 4 as the best number of clusters
2 proposed 5 as the best number of clusters
1 proposed 8 as the best number of clusters

1 proposed 10 as the best number of clusters

           ***** Conclusion *****

According to the majority rule, the best number of clusters is 2

# Visualize
library(factoextra)
fviz_nbclust(res.nbclust, ggtheme = theme_minimal())

Among all indices:

2 proposed 0 as the best number of clusters
1 proposed 1 as the best number of clusters
9 proposed 2 as the best number of clusters
4 proposed 3 as the best number of clusters
6 proposed 4 as the best number of clusters
2 proposed 5 as the best number of clusters
1 proposed 8 as the best number of clusters
1 proposed 10 as the best number of clusters

Conclusion

According to the majority rule, the best number of clusters is 2 .

Computing Environment

sessionInfo()

R version 4.0.3 (2020-10-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] NbClust_3.0 cluster_2.1.0 factoextra_1.0.7
[4] Wu_0.0.0.9000 flexdashboard_0.5.2 lme4_1.1-26
[7] Matrix_1.2-18 mgcv_1.8-33 nlme_3.1-149
[10] png_0.1-7 scales_1.1.1 nnet_7.3-14
[13] labelled_2.7.0 kableExtra_1.3.1 plotly_4.9.3
[16] gridExtra_2.3 ggplot2_3.3.3 DT_0.17
[19] tableone_0.12.0 magrittr_2.0.1 lubridate_1.7.9.2
[22] dplyr_1.0.3 plyr_1.8.6 data.table_1.13.6
[25] rmdformats_0.3.7 knitr_1.30

loaded via a namespace (and not attached): [1] webshot_0.5.2 httr_1.4.2 ggsci_2.9 tools_4.0.3
[5] backports_1.2.0 R6_2.5.0 DBI_1.1.1 lazyeval_0.2.2
[9] colorspace_2.0-0 withr_2.4.0 tidyselect_1.1.0 curl_4.3
[13] compiler_4.0.3 rvest_0.3.6 formatR_1.7 xml2_1.3.2
[17] labeling_0.4.2 bookdown_0.21 stringr_1.4.0 digest_0.6.27
[21] foreign_0.8-80 minqa_1.2.4 rmarkdown_2.6 rio_0.5.16
[25] pkgconfig_2.0.3 htmltools_0.5.1 highr_0.8 readxl_1.3.1
[29] htmlwidgets_1.5.3 rlang_0.4.10 rstudioapi_0.13 generics_0.1.0
[33] farver_2.0.3 jsonlite_1.7.2 crosstalk_1.1.1 dendextend_1.14.0 [37] zip_2.1.1 car_3.0-10 Rcpp_1.0.6 munsell_0.5.0
[41] viridis_0.5.1 abind_1.4-5 lifecycle_0.2.0 stringi_1.5.3
[45] yaml_2.2.1 carData_3.0-4 MASS_7.3-53 grid_4.0.3
[49] ggrepel_0.8.2 forcats_0.5.0 crayon_1.3.4 lattice_0.20-41
[53] haven_2.3.1 splines_4.0.3 hms_1.0.0 pillar_1.4.7
[57] ggpubr_0.4.0 boot_1.3-25 ggsignif_0.6.0 reshape2_1.4.4
[61] glue_1.4.2 evaluate_0.14 mitools_2.4 vctrs_0.3.6
[65] nloptr_1.2.2.2 cellranger_1.1.0 gtable_0.3.0 purrr_0.3.4
[69] tidyr_1.1.2 assertthat_0.2.1 openxlsx_4.2.2 xfun_0.20
[73] broom_0.7.1 survey_4.0 e1071_1.7-4 rstatix_0.6.0
[77] class_7.3-17 survival_3.2-7 viridisLite_0.3.0 tibble_3.0.5
[81] statmod_1.4.35 ellipsis_0.3.1