Clustering Example: 4 Steps You Should Know

This article describes k-means clustering example and provide a step-by-step guide summarizing the different steps to follow for conducting a cluster analysis on a real data set using R software.

We’ll use mainly two R packages:

cluster: for cluster analyses and
factoextra: for the visualization of the analysis results.

Install these packages, as follow:

install.packages(c("cluster", "factoextra"))

A rigorous cluster analysis can be conducted in 3 steps mentioned below:

Data preparation
Assessing clustering tendency (i.e., the clusterability of the data)
Defining the optimal number of clusters
Computing partitioning cluster analyses (e.g.: k-means, pam) or hierarchical clustering
Validating clustering analyses: silhouette plot

Here, we provide quick R scripts to perform all these steps.

Data preparation
Assessing the clusterability
Estimate the number of clusters in the data
Compute k-means clustering
Cluster validation statistics: Inspect cluster silhouette plot
eclust(): Enhanced clustering analysis
- K-means clustering using eclust()
- Hierachical clustering using eclust()
Related Book
Practical Guide to Cluster Analysis in R
Data preparation

We’ll use the demo data set USArrests. We start by standardizing the data using the scale() function:
```
# Load the data set data(USArrests) # Standardize df 
```
Assessing the clusterability The function get_clust_tendency() [factoextra package] can be used. It computes the Hopkins statistic and provides a visual approach. library("factoextra") res ## [1] 0.656 # Visualize the dissimilarity matrix print(res$plot) Estimate the number of clusters in the data As k-means clustering requires to specify the number of clusters to generate, we’ll use the function clusGap() [cluster package] to compute gap statistics for estimating the optimal number of clusters . The function fviz_gap_stat() [factoextra] is used to visualize the gap statistic plot. library("cluster") set.seed(123) # Compute the gap statistic gap_stat The gap statistic suggests a 4 cluster solutions. It’s also possible to use the function NbClust() [in NbClust] package. Compute k-means clustering # Compute k-means set.seed(123) km.res ## Alabama Alaska Arizona Arkansas California Colorado ## 4 3 3 4 3 3 ## Connecticut Delaware Florida Georgia Hawaii Idaho ## 2 2 3 4 2 1 ## Illinois Indiana Iowa Kansas Kentucky Louisiana ## 3 2 1 2 1 4 ## Maine Maryland ## 1 3 # Visualize clusters using factoextra fviz_cluster(km.res, USArrests) Cluster validation statistics: Inspect cluster silhouette plot Recall that the silhouette measures ( $S_i$ ) how similar an object $i$ is to the the other objects in its own cluster versus those in the neighbor cluster. $S_i$ values range from 1 to - 1: A value of $S_i$ close to 1 indicates that the object is well clustered. In the other words, the object $i$ is similar to the other objects in its group. A value of $S_i$ close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results. ## cluster neighbor sil_width ## Alabama 4 3 0.4858 ## Alaska 3 4 0.0583 ## Arizona 3 2 0.4155 ## Arkansas 4 2 0.1187 ## California 3 2 0.4356 ## Colorado 3 2 0.3265 fviz_silhouette(sil) ## cluster size ave.sil.width ## 1 1 13 0.37 ## 2 2 16 0.34 ## 3 3 13 0.27 ## 4 4 8 0.39 It can be seen that there are some samples which have negative silhouette values. Some natural questions are : Which samples are these? To what cluster are they closer? This can be determined from the output of the function silhouette() as follow: neg_sil_index ## cluster neighbor sil_width ## Missouri 3 2 -0.0732 eclust(): Enhanced clustering analysis The function eclust()[factoextra package] provides several advantages compared to the standard packages used for clustering analysis: It simplifies the workflow of clustering analysis It can be used to compute hierarchical clustering and partitioning clustering in a single line function call The function eclust() computes automatically the gap statistic for estimating the right number of clusters. It automatically provides silhouette information It draws beautiful graphs using ggplot2 K-means clustering using eclust() # Compute k-means res.km # Gap statistic plot fviz_gap_stat(res.km$gap_stat) # Silhouette plot fviz_silhouette(res.km) Hierachical clustering using eclust() # Enhanced hierarchical clustering res.hc ## Clustering k = 1,2. K.max (= 10): .. done ## Bootstrapping, b = 1,2. B (= 100) [one "." per sample]: ## . 50 ## . 100 fviz_dend(res.hc, rect = TRUE) # dendrogam The R code below generates the silhouette plot and the scatter plot for hierarchical clustering. fviz_silhouette(res.hc) # silhouette plot fviz_cluster(res.hc) # scatter plot Recommended for you This section contains best data science and self-development resources to help you on your path. Coursera - Online Courses and Specialization Data science Course: Machine Learning: Master the Fundamentals by Stanford Specialization: Data Science by Johns Hopkins University Specialization: Python for Everybody by University of Michigan Courses: Build Skills for a Top Job in any Industry by Coursera Specialization: Master Machine Learning Fundamentals by University of Washington Specialization: Statistics with R by Duke University Specialization: Software Development in R by Johns Hopkins University Specialization: Genomic Data Science by Johns Hopkins University Popular Courses Launched in 2020 Google IT Automation with Python by Google AI for Medicine by deeplearning.ai Epidemiology in Public Health Practice by Johns Hopkins University AWS Fundamentals by Amazon Web Services Trending Courses The Science of Well-Being by Yale University Google IT Support Professional by Google Python for Everybody by University of Michigan IBM Data Science Professional Certificate by IBM Business Foundations by University of Pennsylvania Introduction to Psychology by Yale University Excel Skills for Business by Macquarie University Psychological First Aid by Johns Hopkins University Graphic Design by Cal Arts Amazon FBA Amazing Selling Machine Free Training - How to Build a 7-Figure Amazon FBA Business You Can Run 100% From Home and Build Your Dream Life! by ASM Books - Data Science Our Books Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia) Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia) Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia) R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia) GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia) Network Analysis and Visualization in R by A. Kassambara (Datanovia) Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia) Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia) Others R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham An Introduction to Statistical Learning: with Applications in R by Gareth James et al. Deep Learning with R by François Chollet & J.J. Allaire Deep Learning with Python by François Chollet Comments ( 5 ) Gary Napier 15 Aug 2019 Hi, Great website, excellent content. But why do you have a huge banner at the top that comes almost a third of the way down the page when you scroll down? Particularly as this banner has so much wasted white space in it. Please change it, it’s driving me mad! Thanks Kassambara 16 Aug 2019 Thank you Gary for your suggestion. I’ll work in improving the appearance of the website as soon as possible. 21 Apr 2020 I see that it takes you time to work on removing the barner because it has been a year, and it keeps appearing and it does not allow you to read the page well, do not be lazy and do it. Kassambara 21 Apr 2020 Hi, would you please send to me a screenshot. I’m not sure to understand the bug. I’m using the Chrome and things seem to be fine. Best regards, Alboukadel 18 Nov 2020 hi，is this a typo? res$hopkins_stat ## [1] 0.656 the hopkins stat value equals 0.656… but your comments is “The value of the Hopkins statistic is significantly 0.5” ? Give a comment Cancel reply Welcome! Want More Data Science? Follow us by email How to Become a Machine Learning Engineer Learn AI Programming with Python Programming for Data Science with Python Programming for Data Science with R Become a Data Engineer Become a Deep-Learning Engineer Search Cart Popular Products Rated 4.77 out of 5 Rated 4.58 out of 5 Rated 4.56 out of 5 Blog Categories Basic Statistics (3) Categorical Data Analyses (1) Cluster Analysis (9) Correlation Analysis (1) Data Visualization (14) FAQ (24) ggplot2 (39) Image Processing (1) R Base (2) R Programming (4) R Tips and Tricks (11) Text Mining (1) Recent Articles Beautiful Radar Chart in R using FMSB and GGPlot Packages Venn Diagram with R or RStudio: A Million Ways Beautiful GGPlot Venn Diagram with R Add P-values to GGPLOT Facets with Different Scales GGPLOT Histogram with Density Curve in R using Secondary Y-axis Recent Courses Statistical Tes. Free T-Test Essentia. Free Highcharter R P. Free Inter-Rater Rel. Free Comparing Means. Course & Specialization Course: Machine Learning: Master the Fundamentals Specialization: Data Science Specialization: Python for Everybody Course: Build Skills for a Top Job in any Industry Specialization: Master Machine Learning Fundamentals Specialization: Statistics with R Specialization: Software Development in R Specialization: Genomic Data Science Recent Comments Kirsty Brown on Transform Data to Normal Distribution in R TPL on Sign Test in R Guðni on Intraclass Correlation Coefficient in R Ginny on Intraclass Correlation Coefficient in R Ginny on Intraclass Correlation Coefficient in R DataNovia DataNovia is dedicated to data mining and statistics to help you make sense of your data. We offer data science courses on a large variety of topics, including: R programming, Data processing and visualization, Biostatistics and Bioinformatics, and Machine learning Learn All courses Data science tutorials Pricing Become an Instructor

Clustering Example: 4 Steps You Should Know

Related Book

Data preparation

Assessing the clusterability

Estimate the number of clusters in the data

Compute k-means clustering

Cluster validation statistics: Inspect cluster silhouette plot

eclust(): Enhanced clustering analysis

K-means clustering using eclust()

Hierachical clustering using eclust()

Recommended for you

Coursera - Online Courses and Specialization

Data science

Popular Courses Launched in 2020

Trending Courses

Amazon FBA

Amazing Selling Machine

Books - Data Science

Our Books

Others

Comments ( 5 )

Give a comment Cancel reply

Welcome!

Search

Cart

Popular Products

Blog Categories

Recent Articles

Recent Courses

Statistical Tes.

T-Test Essentia.

Highcharter R P.

Inter-Rater Rel.

Comparing Means.

Course & Specialization

Recent Comments

DataNovia

Learn