This article describes how to fit a linear discriminant analysis (LDA) to predict a categorical variable by two or more numeric variables.
Requirements
A data set containing:
- A categorical outcome variable.
- Two or more numeric predictor variables
This method is only available in Q5.
Method
- Select Create > Classifier > Linear Discriminant Analysis.
- Under Inputs > Linear Discriminant Analysis > Outcome select your outcome variable.
- Under Inputs > Linear Discriminant Analysis > Predictor(s) select your predictor variables.
- Make any other selections as required.
Ordered categorical predictors are coerced to numeric values. Un-ordered categorical predictors are converted to binary dummy variables.
The parameters of the discriminant functions can be extracted with Machine Learning - Diagnostic - Table of Discriminant Function Coefficients.
Example
The table below shows the results of a linear discriminant analysis predicting brand preference based on the attributes of the brand.
Interpretation
- The sub-title (Correct predictions) shows the predictive accuracy of the model, which in this case is extremely poor, at approximately 7%.
- The colored shading shows the differences between the means by group. It shows, for example, that the 1,799 Coca-Cola drinkers in the sample have significantly lower ratings of health-conscious and older (these are the only significant differences when compared to the mean, which is why they are in bold).
- We can also see that there are some significant differences relating to Diet Pepsi.
- The R-Squared column shows the proportion of variance within each row that is explained by the groups; in all cases, it is very poor.
- See Analysis of Variance - One-Way MANOVA for more detail on the interpretation of the table.
There are two reasons why this model is particularly poor:
- The relationship between the predictors and the outcome is weak.
- The Prior is at Equal, which assumes that the group sizes in the population are equal. In this example, Coca-Cola is by far the biggest group, so the prior causes the predicted accuracy to be poor.
Options
The inputs used to generate the Linear Discriminant Analysis are shown below.
Outcome - The variable to be predicted by the predictor variables.
Predictors - The numeric variable(s) to predict the outcome.
Algorithm - The machine learning algorithm. Defaults to Linear Discriminant Analysis but may be changed to other machine learning methods.
Output
- Means - Produces a table showing the means by category, and assorted statistics to evaluate the LDA.
- Detail - More detailed diagnostics, from the lda function in the R MASS package.
- Prediction-Accuracy Table - Produces a table relating the observed and predicted outcome. This is also known as a confusion matrix.
- Scatterplot - A two-dimensional scatterplot of the group centroids in the space of the first two discriminant function variables. This shows which groups are separated by the first two discriminant function variables. Also plotted are the correlations between the predictor variables and the first two discriminant function variables. The group centroids are scaled to appear on the same scale as the correlations.
- Moonplot - A two-dimensional moonplot, using the same assumptions as the scatterplot.
Outcome color - Color of group centroids in Scatterplot output.
Predictors color - Color of variable correlations in Scatterplot output.
Missing data - See Missing Data Options.
Variable names - Displays variable names in the output instead of labels.
Prior - The prior probabilities used in computing the probabilities of group membership of the Outcome (Machine Learning - Save Variable(s) - Probabilities of Each Response). Note that in the main R package for discriminant analysis (MASS:lda), the priors are also used in fitting the model, and this means that results differ between the normal R discriminant analysis and the results in this procedure. This procedure matches the results from SPSS.
- Equal - The prior probabilities are assumed to be equal for each group of the Outcome.
- Observed - Prior computed based on the current (weighted) group sizes. This is the default.
Random seed - Seed used to initialize the (pseudo) random number generator for the model fitting algorithm. Different seeds may lead to slightly different answers, but should normally not make a large difference.
Increase allowed output size - Check this box if you encounter a warning message "The R output had size XXX MB, exceeding the 128 MB limit" and you need to reference the output elsewhere in your document; e.g., to save predicted values to a Data Set or examine diagnostics.
Maximum allowed size for output (MB) - This control only appears if Increase allowed output size is checked. Use it to set the maximum allowed size for the regression output in MegaBytes. The warning referred to above about the R output size will state the minimum size you need to increase to return the full output. Note that having very many large outputs in one document or page may slow down the performance of your document and increase load times.
Weight - Where a weight has been set for the R output, it will be automatically applied when the model is estimated. By default, the weight is assumed to be a sampling weight, and the standard errors are estimated using Taylor series linearization (by contrast, in the Legacy Regression, weight calibration is used). See Weights, Effective Sample Size and Design Effects.
Filter - The data is automatically filtered using any filters applied prior to estimating the model.
Additional options are available by editing the code.
DIAGNOSTICS
Coefficients of discriminant functions - Creates a table of coefficients of linear discriminant functions for each class.
Prediction-Accuracy Table - Creates a table showing the observed and predicted values, as a heatmap.
SAVE VARIABLE(S)
Discriminant Variables - Creates a new question containing the discriminant variables.
Predicted Values - Creates a new variable containing predicted values for each case in the data.
Probabilities of Each Response - Creates new variables containing predicted probabilities of each response.
Acknowledgments
The algorithm used for fitting the LDA is a modification of MASS:lda, generalized to accommodate weights. The multcomp package is used to test comparisons (see also Regression - Generalized Linear Model, which describes the models that are used by multcomp). The survey package is used to compute the p for each of the variables in Means; a Wald test is used (regTermTest)
More information
See this post for a description of LDA.
See this post for a practical guide on how to run LDA in Displayr.