Introduction
This article describes how to fit a linear discriminant analysis (LDA) to predict a categorical variable by two or more numeric variables.
Requirements
A data set containing:
 A categorical outcome variable.
 Two or more numeric predictor variables
This method is only available in Q5.
Method
Usage
To run a Linear Discriminant Analysis:
1. Select Create > Classifier > Linear Discriminant Analysis.
2. Under Inputs > Linear Discriminant Analysis > Outcome select your outcome variable.
3. Under Inputs > Linear Discriminant Analysis > Predictor(s) select your predictor variables.
4. Make any other selections as required.
Ordered categorical predictors are coerced to numeric values. Unordered categorical predictors are converted to binary dummy variables.
The parameters of the discriminant functions can be extracted with Machine Learning  Diagnostic  Table of Discriminant Function Coefficients.
Example
The table below shows the results of a linear discriminant analysis predicting brand preference based on the attributes of the brand.
Interpretation
 The subtitle (Correct predictions) shows the predictive accuracy of the model, which in this case is extremely poor, at approximately 7%.
 The colored shading shows the differences between the means by group. It shows, for example, that the 1,799 CocaCola drinkers in the sample has significantly lower ratings of healthconscious and older (these are the only significant differences, when compared to the mean, which is why they are in bold).
 We can also see that there are some significant differences relating to Diet Pepsi.
 The RSquared column shows the proportion of variance within each row that is explained by the groups; in all cases it is very poor.
 See Analysis of Variance  OneWay MANOVA for more detail on the interpretation of the table.
There are two reasons why this model is particularly poor:
 The relationship between the predictors and the outcome is weak.
 The Prior is at Equal, which assumes that the group sizes in the population are equal. In this example, CocaCola is by far the biggest group, so the prior causes the predicted accuracy to be poor.
Options
The inputs used to generate the Linear Discriminant Analysis are shown below.
Outcome  The variable to be predicted by the predictor variables.
Predictors  The numeric variable(s) to predict the outcome.
Algorithm  The machine learning algorithm. Defaults to Linear Discriminant Analysis but may be changed to other machine learning methods.
Output

 Means  Produces a table showing the means by category, and assorted statistics to evaluate the LDA.
 Detail  More detailed diagnostics, from the lda function in the R MASS package.
 PredictionAccuracy Table  Produces a table relating the observed and predicted outcome. This is also known as a confusion matrix.
 Scatterplot  A twodimensional scatterplot of the group centroids in the space of the first two discriminant function variables. This shows which groups are separated by the first two discriminant function variables. Also plotted are the correlations between the predictor variables and the first two discriminant function variables. The group centroids are scaled to appear on the same scale as the correlations.
 Moonplot  A twodimensional moonplot, using the same assumptions as the scatterplot.
Outcome color  Color of group centroids in Scatterplot output.
Predictors color  Color of variable correlations in Scatterplot output.
Missing data  See Missing Data Options.
Variable names  Displays variables names in the output instead of labels.
Prior  The prior probabilities used in computing the probabilities of group membership of the Outcome (Machine Learning  Save Variable(s)  Probabilities of Each Response). Note that in the main R package for discriminant analysis (MASS:lda), the priors are also used in fitting the model, and this means that results differ between the normal R discriminant analysis and the results in this procedure. This procedure matches the results from SPSS.

 Equal  The prior probabilities are assumed to be equal for each group of the Outcome.
 Observed  Prior computed based on the current (weighted) group sizes. This is the default.
Random seed  Seed used to initialize the (pseudo) random number generator for the model fitting algorithm. Different seeds may lead to slightly different answers, but should normally not make a large difference.
Increase allowed output size  Check this box if you encounter a warning message "The R output had size XXX MB, exceeding the 128 MB limit" and you need to reference the output elsewhere in your document; e.g., to save predicted values to a Data Set or examine diagnostics.
Maximum allowed size for output (MB)  This control only appears if Increase allowed output size is checked. Use it to set the maximum allowed size for the regression output in MegaBytes. The warning referred to above about the R output size will state the minimum size you need to increase to to return the full output. Note that having very many large outputs in one document or page may slow down the performance of your document and increase load times.
Weight  Where a weight has been set for the R output, it will be automatically applied when the model is estimated. By default, the weight is assumed to be a sampling weight, and the standard errors are estimated using Taylor series linearization (by contrast, in the Legacy Regression, weight calibration is used). See Weights, Effective Sample Size and Design Effects.
Filter  The data is automatically filtered using any filters applied prior to estimating the model.
Additional options are available by editing the code.
DIAGNOSTICS
Coefficients of discriminant functions  Creates a table of coefficients of linear discriminant functions for each class.
PredictionAccuracy Table  Creates a table showing the observed and predicted values, as a heatmap.
SAVE VARIABLE(S)
Discriminant Variables  Creates a new question containing the discriminant variables.
Predicted Values  Creates a new variable containing predicted values for each case in the data.
Probabilities of Each Response  Creates new variables containing predicted probabilities of each response.
Acknowledgements
The algorithm used for fitting the LDA is a modification of MASS:lda, generalized to accommodate weights. The multcomp package is used to test comparisons (see also Regression  Generalized Linear Model, which describes the models that are used by multcomp). The survey package is used to compute the p for each of the variables in Means; a Wald test is used (regTermTest)
More information
See this post for a description of LDA.
See this post for a practical guide of how to run LDA in Displayr.
Next
Comments
0 comments
Article is closed for comments.