Skip to main content
Get your Wikispaces Classroom now:
the easiest way to manage your class.
Pages and Files
Terms & Concepts
Health Problems, Illnesses & Diseases
Full List of Entries
24-hour diet recall
Alan H. Goodman
Andrea S. Wiley
Asset Mapping and Geographic Information System (GIS)
Body Mass Index (BMI) Measurement
Caida de Mollera (sunken fontanelle)
Cecilia Van Hollen
Choosing a Qualitative Data Analysis Software Program
Chronic Hunger (linked to food security)
Clarence C. Gravlee
Communication Science Disorders (CSD)
Contingent Valuation Method
Critical Medical Anthropology
Developmental Origins of Health and Disease (DOHaD)
Dr. John E. Sarno
Early Childhood Caries
Eating Disorders Not Otherwise Specified (EDNOS)
Edward C. (Ted) Green
Health Problems, Illnesses & Diseases
HIV and AIDS in the United States
Household Food Insecurity (Access) Scale
John Bryan Page
Libbet Crandon Malamud
Life History Theory
Lighting for Video Interviews
Oral History and Life History
Photo and Video Elicitation
Post-traumatic Stress Disorder
Principles of Analysis
Rapid Ethnographic Assessment
Risk Perception Mapping
Rudolf C. Virchow
Secondary Qualitative Data Sources and How to Find Them
Skinfold Thickness Measurements
Social Media Content Analysis
Social Network Analysis
Structural Adjustment Programs
Susan E Keefe
The Go-Along Method
Third Epidemiological Transition
Water Quality and Dams
A type of quantitative analysis of categorical or numerical data which entails the grouping of a set of objects into the same group (termed a 'cluster) based on similarities in features and determined through statistical analysis.
For example, cluster analysis is frequently used in epidemiology to determine the distribution of disease or illness in a population by specific grouped features (e.g. household income, dietary composition). In archaeology it is often used to group artifacts based on specific features (e.g. presence or absence of food residue on cooking vessels).
In effect, cluster analysis is about finding the most useful groupings (called 'clusters') of data within a data set in order to reveal patterns to the researcher that may not have been evident through other means such as descriptive statistics.
For this reason it is considered to be well-suited for exploratory data analysis, which involves uncovering patterns in data sets through statistical techniques.
"Method Made Easy"
Cluster analysis typically requires the use of computer software at all stages.
Cluster analysis is usually done through statistical software packages like SPSS and R, both of which offer a wide variety of cluster models, examples of which include the commonly-used centroid, connectivity, and density models.
: typically used in hierarchical clustering to build models based on the notion that objects are more closely related to other objects nearby than they are to objects further away. Observations are assigned to clusters based on their distances to one another. Clusters and observations are frequently represented using a dendrogram (i.e. a taxonomic tree diagram).
Example of dendrogram from Souza et al (2003)
: while a number of different centroid models exist, the most frequently used one is the
, in which
initial statistical means are randomly generated for
clusters are created by grouping every object with the nearest mean, and the centroid of each of these clusters is computd to create the new mean. These steps are repeated until clusters are optimally assigned to all data points in the data set. The data space is then partitioned into Voronoi diagrams which indicate the divisions between clusters.
Voronoi diagram superimposed on a map from Thia (2010)
: observations are assigned to clusters probabilistically (i.e. using a Gaussian distribution to estimate the likelihood that a data point belongs to a specific cluster. Objects that fall between clusters are discarded as outliers or 'noise'.
Example of density map
Cluster analysis is a mostly automated process undertaken through computer software, a fact that lends cluster analysis particularly well to data mining and analysis of large data sets, such as those used in epidemiology, bioinformatics, or resource planning (Lu and Liang 2008). Clusters can be represented visually through a number of means, including dendrograms ('tree diagrams') and Clusters can also be easily transposed as layers onto maps, making it a useful tool for geographic information systems (GIS).
The quality and usefulness of cluster analysis strongly depends on the definition of clusters and the selection of algorithms by the researcher. Definitions of clusters can therefore be quite arbitrary. Failure to select optimal algorithms or models may result in misleading outcomes --- in effect, with cluster analysis, if the input is garbage or if the algorithm used is inappropriate for the data set, then the output will be useless or even misleading.
Cluster analysis is an inherently analytic method. Most commonly, clusters generated from data sets are visualized (through scatterplots, dendrograms, or other types of visualizations). Results may be either internally or externally validated. With internal validation, a number of models appropriate to the data type may be used to compare the differing cluster assignments. Clusters can be further evaluated by assigning a 'best score' to the algorithm that produces clusters with both the highest similarity within and the lowest similarity between clusters.
With external validation, results are evaluated through data not included in the analysis. These outside data are typically referred to as 'class labels' or 'external benchmarks' and are often created by human experts (the process of creating which may also be limited by its arbitrary nature).
Method in Context
An example of cluster analysis used in conjunction with spatial data would be the use of a
means model to suggest potential locations for the source, vector, or reservoir of a disease outbreak by plotting all observations spatially and calculating the location of the centroids in order to create a list of likely sites for investigation. This is in fact not entirely dissimilar to John Snow's now-canonical plotting of cholera cases in London to suggest the location of water pumps that might have contributed to the 1854 cholera outbreak.
Locations and other details relating to the disease outbreak would be collected and plotted onto a map of the area under investigation. If one wanted to determine the locations of potential clusters of disease cases, then the data set would be imported into a statistical software package with cluster analysis support (e.g. SPSS or R) and then one or more clustering models would be run to suggest some likely case clusters based.
If one is only interested in the clusters themselves, and not the location of these clusters, then a connectivity model can be run and a dendrogram created. If location of potential case clusters is needed, then the locations of the centroids of these clusters could be plotted onto a map, and boundaries between zones can be created through a Voronoi diagram. The resulting map would show not only potential locations of case clusters, but also demarcate zones for intervention to optimize limited public health resources.
Online Resources and Further Reading
Souza, J. et al. (2003) A quantum chemical and statistical study of flavonoid compounds (flavones) with anti-HIV activity.
--- example shown in Figure 1 of dendrograms applied to medical reserach.
Kai Xin, Thia (2011) Catchment area of Junrong General Hospital. Singapore Management University wiki.
--- example shown in Figure 2 (Voronoi diagrams in spatial mapping)
Lu, Yongmei. (2000) Spatial cluster analysis for point data: Location quotients versus kernel density."
Presentation at Information Science Summer Assembly, Portland, Oregon
--- example shown in Figure 3 (density map of vehicle availability in Buffalo, NY)
The R Project
--- open source statistical analysis package similar in functionality to SPSS and other commercial software suites.
help on how to format text
Turn off "Getting Started"