Definition

A type of quantitative analysis of categorical or numerical data which entails the grouping of a set of objects into the same group (termed a 'cluster) based on similarities in features and determined through statistical analysis.

For example, cluster analysis is frequently used in epidemiology to determine the distribution of disease or illness in a population by specific grouped features (e.g. household income, dietary composition). In archaeology it is often used to group artifacts based on specific features (e.g. presence or absence of food residue on cooking vessels).

Relevant Characteristics

In effect, cluster analysis is about finding the most useful groupings (called 'clusters') of data within a data set in order to reveal patterns to the researcher that may not have been evident through other means such as descriptive statistics. For this reason it is considered to be well-suited for exploratory data analysis, which involves uncovering patterns in data sets through statistical techniques.

"Method Made Easy"

Cluster analysis typically requires the use of computer software at all stages. Cluster analysis is usually done through statistical software packages like SPSS and R, both of which offer a wide variety of cluster models, examples of which include the commonly-used centroid, connectivity, and density models.

  • Connectivity model: typically used in hierarchical clustering to build models based on the notion that objects are more closely related to other objects nearby than they are to objects further away. Observations are assigned to clusters based on their distances to one another. Clusters and observations are frequently represented using a dendrogram (i.e. a taxonomic tree diagram).


dendrogram.jpg
Example of dendrogram from Souza et al (2003)

  • Centroid models: while a number of different centroid models exist, the most frequently used one is the k-means model, in which k initial statistical means are randomly generated for n observations. Next, k clusters are created by grouping every object with the nearest mean, and the centroid of each of these clusters is computd to create the new mean. These steps are repeated until clusters are optimally assigned to all data points in the data set. The data space is then partitioned into Voronoi diagrams which indicate the divisions between clusters.

voronoi singapore.jpg
Voronoi diagram superimposed on a map from Thia (2010)


  • Density models: observations are assigned to clusters probabilistically (i.e. using a Gaussian distribution to estimate the likelihood that a data point belongs to a specific cluster. Objects that fall between clusters are discarded as outliers or 'noise'.


vehicle availability buffalo.jpg
Example of density map


Advantages
Cluster analysis is a mostly automated process undertaken through computer software, a fact that lends cluster analysis particularly well to data mining and analysis of large data sets, such as those used in epidemiology, bioinformatics, or resource planning (Lu and Liang 2008). Clusters can be represented visually through a number of means, including dendrograms ('tree diagrams') and Clusters can also be easily transposed as layers onto maps, making it a useful tool for geographic information systems (GIS).

Limitations

The quality and usefulness of cluster analysis strongly depends on the definition of clusters and the selection of algorithms by the researcher. Definitions of clusters can therefore be quite arbitrary. Failure to select optimal algorithms or models may result in misleading outcomes --- in effect, with cluster analysis, if the input is garbage or if the algorithm used is inappropriate for the data set, then the output will be useless or even misleading.

Analysis

Cluster analysis is an inherently analytic method. Most commonly, clusters generated from data sets are visualized (through scatterplots, dendrograms, or other types of visualizations). Results may be either internally or externally validated. With internal validation, a number of models appropriate to the data type may be used to compare the differing cluster assignments. Clusters can be further evaluated by assigning a 'best score' to the algorithm that produces clusters with both the highest similarity within and the lowest similarity between clusters.

With external validation, results are evaluated through data not included in the analysis. These outside data are typically referred to as 'class labels' or 'external benchmarks' and are often created by human experts (the process of creating which may also be limited by its arbitrary nature).

Method in Context

An example of cluster analysis used in conjunction with spatial data would be the use of a k-means model to suggest potential locations for the source, vector, or reservoir of a disease outbreak by plotting all observations spatially and calculating the location of the centroids in order to create a list of likely sites for investigation. This is in fact not entirely dissimilar to John Snow's now-canonical plotting of cholera cases in London to suggest the location of water pumps that might have contributed to the 1854 cholera outbreak.

Locations and other details relating to the disease outbreak would be collected and plotted onto a map of the area under investigation. If one wanted to determine the locations of potential clusters of disease cases, then the data set would be imported into a statistical software package with cluster analysis support (e.g. SPSS or R) and then one or more clustering models would be run to suggest some likely case clusters based.

If one is only interested in the clusters themselves, and not the location of these clusters, then a connectivity model can be run and a dendrogram created. If location of potential case clusters is needed, then the locations of the centroids of these clusters could be plotted onto a map, and boundaries between zones can be created through a Voronoi diagram. The resulting map would show not only potential locations of case clusters, but also demarcate zones for intervention to optimize limited public health resources.

Online Resources and Further Reading


Souza, J. et al. (2003) A quantum chemical and statistical study of flavonoid compounds (flavones) with anti-HIV activity. http://www.sciencedirect.com/science/article/pii/S0223523403001429 --- example shown in Figure 1 of dendrograms applied to medical reserach.

Kai Xin, Thia (2011) Catchment area of Junrong General Hospital. Singapore Management University wiki.
https://wiki.smu.edu.sg/1011t2is415g1/IS415_2010-11_Term2_Assign2_THIA_KAI_XIN --- example shown in Figure 2 (Voronoi diagrams in spatial mapping)

Lu, Yongmei. (2000) Spatial cluster analysis for point data: Location quotients versus kernel density." Presentation at Information Science Summer Assembly, Portland, Oregon. http://dusk.geo.orst.edu/ucgis/web/oregon/papers/lu.htm --- example shown in Figure 3 (density map of vehicle availability in Buffalo, NY)


The R Project --- open source statistical analysis package similar in functionality to SPSS and other commercial software suites.