Clustering
Clustering with R
Data
## install.packages("NbClust")
## install.packages("reshape2")
## install.packages("viridis")
## install.packages("factoextra")
## install.packages("cluster")
## install.packages("magrittr")
library(factoextra)
library(cluster)
library(magrittr)
library(Wu)
# Load and prepare the data
data("USArrests")
my_data <- USArrests %>%
na.omit() %>% # Remove missing values (NA)
scale() %>% # Scale variables
as.data.table()
my_data %>% DT()
Data Summary
## colnames(my_data)
Vars <- colnames(my_data)
FactorVars <- NULL
t <- Table1n(obj = my_data, Vars = Vars, FactorVars = FactorVars)
t %>% prt()
what | level | Overall | Overall | Missing |
---|---|---|---|---|
n | 50 | 50 | ||
Murder | mean (SD) | 0.0 (1.0) | 0.0 (1.0) | 0.0 |
median [IQR] | -0.1 [-0.9, 0.8] | -0.1 [-0.9, 0.8] | 0.0 | |
median [range] | -0.1 [-1.6, 2.2] | -0.1 [-1.6, 2.2] | 0.0 | |
Assault | mean (SD) | 0.0 (1.0) | 0.0 (1.0) | 0.0 |
median [IQR] | -0.1 [-0.7, 0.9] | -0.1 [-0.7, 0.9] | 0.0 | |
median [range] | -0.1 [-1.5, 2.0] | -0.1 [-1.5, 2.0] | 0.0 | |
UrbanPop | mean (SD) | 0.0 (1.0) | 0.0 (1.0) | 0.0 |
median [IQR] | 0.0 [-0.8, 0.8] | 0.0 [-0.8, 0.8] | 0.0 | |
median [range] | 0.0 [-2.3, 1.8] | 0.0 [-2.3, 1.8] | 0.0 | |
Rape | mean (SD) | 0.0 (1.0) | 0.0 (1.0) | 0.0 |
median [IQR] | -0.1 [-0.7, 0.5] | -0.1 [-0.7, 0.5] | 0.0 | |
median [range] | -0.1 [-1.5, 2.6] | -0.1 [-1.5, 2.6] | 0.0 |
Distance
- Distance by four continuous variables
- Methods for Measuring Distance
- Fred Szabo: Manhattan Distance
- Euclidean Distance \[d_{euc}(x,y)=\sqrt{\sum_{i=1}^n(x_i - y_i)^2}\]
- Manhattan Distance is the distance a car would drive in a city (Manhattan). It is the sum of absolute differences. It is also known as \(L^1\) norm. \[d_{man}(x,y)=\sum_{i=1}^n|(x_i - y_i)|\]
- Pearson correlation distance: it measures the degree of a linear relationship between two profiles. It is another type of dissimilarity measuremeants called correlation-based distance. r values from -1 to 1. It could be converted to range of 0 to 1. \[d_{cor}(x,y) = 1- \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2}\sum_{i=1}^n(y_i - \bar{y})^2}\] \[d = (1 - r)/2\]
- Spearman correlation distance: computes the correlation between the rank of x and y.
- Kendall Correlation Distance. \[d_{kend}(x,y) = 1 - \frac{n_c - c_d}{n(n-1)/2}\] \[n_c : total\ number\ of\ concordant\ pairs\] \[n_d:\ total\ number\ of\ discordant\ pairs\] \[n(n-1)/2 = total\ number\ of\ possible\ pairings\]
library(factoextra)
res.dist <- get_dist(USArrests, stand = TRUE, method = "manhattan")
fviz_dist(res.dist, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
Gower Dissimilarity
- Gower's dissimilarity measure for mixed numeric/categorical data
- The overall similarity score is the sum of all the individual variable similarity divided by the total possible comparisons.
library(cluster)
Dist <- daisy(iris, metric = c("gower"))
library(factoextra)
fviz_dist(Dist)
## hclust(Dist)
Silhouette
- Silhouette measures the consistency within clusters of data
- Its value ranges from -1 to +1, high value indicates the object is well matched to its own cluster and poorly matched to nerighoring clusters.
- a: Mean distance within cluster for i and all other data points.
- b: smallest mean distance of i to all points in any other cluster. The cluster with this smallest mean dissimilarity is said to be the neighboring cluster.
- silhouette values is \(\frac{b - a}{max(a,b}\)
- score is zero if cluster size = 1
- pam: partition around medoids function
- Medoids are representative object within a cluster. It is similar in concept to means or centroids, but medoids are always restricted to be members of the cluster.
## silhouette plot
pamx <- pam(Dist, 3)
sil = silhouette(pamx$clustering, Dist)
plot(sil)
Kmeans Clustering
library("factoextra")
set.seed(123)
fviz_nbclust(my_data, kmeans, nstart = 10, method = "gap_stat", nboot = 500) + labs(subtitle = "Gap statistic method")
Plot Kmeans Clusters
set.seed(123)
km.res <- kmeans(my_data, 3, nstart = 25)
# Visualize
library("factoextra")
fviz_cluster(km.res, data = my_data, ellipse.type = "convex", palette = "jco", ggtheme = theme_minimal())
## GAP Statistic
- [[https://statweb.stanford.edu/~gwalther/gap][GAP Statist]]
To estimate the optimal number of clusters. Provide a statistical procedure to formalize that heuristics.
D: sum of the pairwise distances for all points in cluster r W: Sum of cluster mean of D. If D is the squared Euclidean distance, then W is the pooled within-cluster sum of squared around the cluster means.
- [[https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/#:~:text=The%2520gap%2520statistic%2520compares%2520the,yields%2520the%2520largest%2520gap%2520statistic).][Gap Statistic Method]]
The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optiomal clusters will be valued that maximize the gap statistics (that yields the largest gap statistic). This means that the clustering structure is far away from the random uniform distribution of points.
1, calculate within-cluster variation W (observed) 2, Generate B reference data sets with a random uniform distribution. calculate W (expected) 3, Compute the estimate gap statistic as the deviation of the observed W from expected W 4, Choose the smallest value of cluster number k, and the gap statistic is within one standard deviation of the gap at k+1
Hierarchical Clustering
- Cluster Dendrogram
- A dendrogram is a diagram that shows the hierarchical relationship between objects.
# Compute hierarchical clustering
library(Wu)
res.hc <- USArrests %>%
scale() %>% # Scale the data
dist(method = "euclidean") %>% # Compute dissimilarity matrix
hclust(method = "ward.D2") # Compute hierachical clustering
# Visualize using factoextra
# Cut in 4 groups and color by groups
fviz_dend(res.hc, k = 4, # Cut in four groups
cex = 0.5, # label size
k_colors = c("#2E9FDF", "#00AFBB", "#E7B800", "#FC4E07"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE # Add rectangle around groups
)
Determine the Optimal Number of Clusters
set.seed(123)
# Compute
library("NbClust")
res.nbclust <- USArrests %>% scale() %>% NbClust(distance = "euclidean", min.nc = 2,
max.nc = 10, method = "complete", index = "all")
*** : The Hubert index is a graphical method of determining the number of clusters. In the plot of Hubert index, we seek a significant knee that corresponds to a significant increase of the value of the measure i.e the significant peak in Hubert index second differences plot.
*** : The D index is a graphical method of determining the number of clusters. In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase of the value of the measure.
- Among all indices:
- 9 proposed 2 as the best number of clusters
- 4 proposed 3 as the best number of clusters
- 6 proposed 4 as the best number of clusters
- 2 proposed 5 as the best number of clusters
- 1 proposed 8 as the best number of clusters
1 proposed 10 as the best number of clusters
***** Conclusion *****
According to the majority rule, the best number of clusters is 2
# Visualize
library(factoextra)
fviz_nbclust(res.nbclust, ggtheme = theme_minimal())
Among all indices:
- 2 proposed 0 as the best number of clusters
- 1 proposed 1 as the best number of clusters
- 9 proposed 2 as the best number of clusters
- 4 proposed 3 as the best number of clusters
- 6 proposed 4 as the best number of clusters
- 2 proposed 5 as the best number of clusters
- 1 proposed 8 as the best number of clusters
- 1 proposed 10 as the best number of clusters
Conclusion
- According to the majority rule, the best number of clusters is 2 .
Computing Environment
sessionInfo()
R version 4.0.3 (2020-10-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] NbClust_3.0 cluster_2.1.0 factoextra_1.0.7
[4] Wu_0.0.0.9000 flexdashboard_0.5.2 lme4_1.1-26
[7] Matrix_1.2-18 mgcv_1.8-33 nlme_3.1-149
[10] png_0.1-7 scales_1.1.1 nnet_7.3-14
[13] labelled_2.7.0 kableExtra_1.3.1 plotly_4.9.3
[16] gridExtra_2.3 ggplot2_3.3.3 DT_0.17
[19] tableone_0.12.0 magrittr_2.0.1 lubridate_1.7.9.2
[22] dplyr_1.0.3 plyr_1.8.6 data.table_1.13.6
[25] rmdformats_0.3.7 knitr_1.30
loaded via a namespace (and not attached): [1] webshot_0.5.2 httr_1.4.2 ggsci_2.9 tools_4.0.3
[5] backports_1.2.0 R6_2.5.0 DBI_1.1.1 lazyeval_0.2.2
[9] colorspace_2.0-0 withr_2.4.0 tidyselect_1.1.0 curl_4.3
[13] compiler_4.0.3 rvest_0.3.6 formatR_1.7 xml2_1.3.2
[17] labeling_0.4.2 bookdown_0.21 stringr_1.4.0 digest_0.6.27
[21] foreign_0.8-80 minqa_1.2.4 rmarkdown_2.6 rio_0.5.16
[25] pkgconfig_2.0.3 htmltools_0.5.1 highr_0.8 readxl_1.3.1
[29] htmlwidgets_1.5.3 rlang_0.4.10 rstudioapi_0.13 generics_0.1.0
[33] farver_2.0.3 jsonlite_1.7.2 crosstalk_1.1.1 dendextend_1.14.0 [37] zip_2.1.1 car_3.0-10 Rcpp_1.0.6 munsell_0.5.0
[41] viridis_0.5.1 abind_1.4-5 lifecycle_0.2.0 stringi_1.5.3
[45] yaml_2.2.1 carData_3.0-4 MASS_7.3-53 grid_4.0.3
[49] ggrepel_0.8.2 forcats_0.5.0 crayon_1.3.4 lattice_0.20-41
[53] haven_2.3.1 splines_4.0.3 hms_1.0.0 pillar_1.4.7
[57] ggpubr_0.4.0 boot_1.3-25 ggsignif_0.6.0 reshape2_1.4.4
[61] glue_1.4.2 evaluate_0.14 mitools_2.4 vctrs_0.3.6
[65] nloptr_1.2.2.2 cellranger_1.1.0 gtable_0.3.0 purrr_0.3.4
[69] tidyr_1.1.2 assertthat_0.2.1 openxlsx_4.2.2 xfun_0.20
[73] broom_0.7.1 survey_4.0 e1071_1.7-4 rstatix_0.6.0
[77] class_7.3-17 survival_3.2-7 viridisLite_0.3.0 tibble_3.0.5
[81] statmod_1.4.35 ellipsis_0.3.1