How to Do a K-Means Cluster Analysis

Introduction

This article describes how to do a K-Means cluster analysis in Q. The K-Means cluster analysis algorithm is a method for grouping similar cases into groups, or clusters. The final clusters will be different from each other, while the cases within a cluster are broadly similar to each other.

Requirements

A data set containing the variables that you want to use as inputs to the cluster analysis segmentation. The variables need to be Numeric or Dichotomous.

Method

From the toolbar menu, select Create > Segments> Cluster > K-Means Cluster Analysis.
From the object inspector on the right, select the inputs (clustering variables) from the Variables drop-down in the Inputs section. For this example, we'll select the 11 behavioral/attitudinal statements on mobile technology. The questions are dichotomous (Selected/Not Selected).
You can use any other numeric variables as clustering variables that can potentially provide differentiation between the respondents and therefore help define the clusters.
Note that if the variables are grouped in a Question, then the Question may be selected instead, which is more convenient than selecting multiple variables.
Select the number of clusters that you want to create in the Number of clusters selection text box. I've selected 3 clusters for this example, but you can choose any value you want here.
Optional: Modify any of the other input settings as desired. For this example, we'll leave the default values selected.
Click the Calculate button (or tick the Automatic checkbox so that the analysis will re-run automatically if any changes are made).

The following output is generated:

Interpreting the Results

The standard table of means output shown above lists each of the clustering variables in the rows and shows the mean Top 2 Box percentage for each of the clusters.

The size of each cluster (n) is shown in the column header.
The red and blue highlights indicate whether or not the Top 2 Box score is higher (blue) or lower (red) than the overall mean. The red and blue colors are also scaled to provide some additional differentiation (darker shades of red/blue are farther from the mean).
Means in bold font are significantly higher/lower than the mean score.
The R-Squared value shows proportion of variance in the cluster assignment that is explained by the each of the clustering variables. In the example above, we can see that there are 4 statements that have a greater impact on the segment/cluster predictions than do the remaining variables.
The p-value shows which statement variables are significant in the model.
Where weights are provided, the percentages show weighted data but the n does not.
The Variance Explained is a multivariate R-squared statistic, which is sometimes known as omega-squared in the cluster analysis literature.
The Calinksi-Harabasz statistic can be useful when selecting the number of segments (higher is better), however, it should not be relied upon as the ultimate arbiter of number of segments as it is not particularly scientific.

Saving Cluster Membership

Individual respondents can be assigned to the individual clusters in Displayr by first selecting the k-Means Cluster Analysis output and then selecting Inputs > Save Variable(s) > Cluster Membership. A new categorical variable is added to the top of the data set called "Segment/Cluster memberships from r.output". Locate the new variable in the Variables and Output tab and hover over it to preview the respondent level membership data or drag the variable onto the page to create a table.

This segment/cluster variable can be used for profiling against your demographic variables. Once you've identified the key differences between your clusters, try to come up with names that describe each cluster. You can add then these names to the cluster variable by first selecting the variable in the Variables and Output tab, click the button and enter your the cluster names in the Label column. Click OK to save the cluster names.

Technical Details

The Batch algorithm works as follows:

The Hartigan-Wong k-means algorithm is used to find clusters with missing data set to Exclude cases with missing data.
Cases are assigned to the most similar cluster. Where Missing data is set to use partial data (the default), this means that cases that were ignored by Hartigan-Wong are now included in the analysis.
The cluster centers are updated. Where weights have been applied, this means that the cluster centers now reflect weights (they were ignored by Hartigan-Wong).
The previous two steps are repeated until the either the maximum number of iterations, iter.max has been exceeded (which defaults to 100), or, the Omega-Squared does not increase.

How to Use the Segments Option in versions of Q prior to Q5.0

How To Standardize/Normalize Variables When Creating Segments

How to Assign Respondents to Clusters/Segments in a New Data File

Articles in this section

Introduction

Requirements

Method

Interpreting the Results

Saving Cluster Membership

Technical Details

Next

Articles in this section

Introduction

Requirements

Method

Interpreting the Results

Saving Cluster Membership

Technical Details

Next

Related articles