Introduction
A Classification And Regression Tree (CART), is a predictive model, which explains how an outcome variable's values can be predicted based on other values. A CART output is a decision tree where each fork is a split in a predictor variable and each end node contains a prediction for the outcome variable.
Requirements
A data set containing and outcome variable and predictor variables to use the predictive model.
Method
Example
Select Create > Classifier > Machine Learning > Classification and Regression Trees (CART).
An interactive tree created using the Sankey output option using 'Preferred Cola' as the Outcome variable and 'Age', 'Gender' and 'Exercise Frequency' as the Predictor variables.
Options
Outcome - Variable to be predicted.
Predictors - Variables which will be considered as predictors of the Outcome. Predictors that are considered to be uninformative will be automatically excluded from the model.
Algorithm - The machine learning algorithm. Defaults to CART but may be changed to other machine learning methods.
Output - How the tree should be displayed. The choices are:
- Sankey: An interactive tree. This is the default.
- Tree: A greyscale tree plot.
- Text: A text representation of the tree.
- Prediction-Accuracy Table: Produces a table relating the observed and predicted outcome. Also known as a confusion matrix.
- Cross Validation: A plot of the cross-validation accuracy versus the size of the tree in terms of the number of leaves.
Missing data - Method for dealing with missing data. See Missing Data Options.
Pruning - The type of post-pruning applied to the tree. Choices are:
- Minimum error: Prune back leaf nodes to create the tree with the smallest cross validation error.
- Smallest tree: Prune to create the smallest tree with cross validation error at most 1 standard error greater than the minimum error.
- None: Retain the tree as it has been built. Note that choosing this option without Early stopping is prone to overfitting.
Early stopping - Whether to stop splitting nodes before the fit stops improving. Setting this may decrease the time to build the tree, potentially at the cost of not finding the tree with the best accuracy. See here for more detail.
Variable names - Displays variable names in the output.
Predictor category labels - Whether to shorten category labels from categorical predictor variables. The choices are:
- Full labels: The complete labels.
- Abbreviated labels: Labels that have been shortened by taking the first few letters from each word.
- Letters: Letters from the alphabet where "a" corresponds to the first category, "b" to the second category, and so on.
Outcome category labels - Same as predictor category labels but for the outcome variable.
Allow long-running calculations - Predictors with m categories require evaluation of 2^(m - 1) split points. This may cause calculations to run for a long time. Checking this box allows categorical variables with more than 30 categories to be included in Predictors.
Random seed - Seed used to initialize the (pseudo) random number generator for the model fitting algorithm. Different seeds may lead to slightly different answers, but should normally not make a large difference.
Increase allowed output size - Check this box if you encounter a warning message "The R output had size XXX MB, exceeding the 128 MB limit..." and you need to reference the output elsewhere in your document; e.g., to save predicted values to a Data Set or examine diagnostics.
Maximum allowed size for output (MB) - This control only appears if Increase allowed output size is checked. Use it to set the maximum allowed size for the regression output in MegaBytes. The warning referred to above about the R output size will state the minimum size you need to increase to to return the full output. Note that having many large outputs in one project may slow down the overall performance of your and increase load times.
DIAGNOSTICS
Prediction-Accuracy Table - Creates a table showing the observed and predicted values, as a heatmap.
SAVE VARIABLE(S)
Predicted Values - Creates a new variable containing predicted values for each case in the data.
Probabilities of Each Response - Creates new variables containing predicted probabilities of each response.
Acknowledgements
Uses the R packages rpart and partykit.
More information
For an introduction to decision trees, see this blog post.
This blog post discusses the Pruning and Early stopping options.
This post describes how trees are built.
Next
Comments
0 comments
Article is closed for comments.