Introduction
Boosting is a method for combining a series of simple individual models to create a more powerful model. An initial model (either tree or linear regression) is fitted to the data. A second model is then built that focuses on accurately predicting for the cases where the first model performs poorly relative to its target outcomes. The combination of these two models is expected to be better than either model alone. The process is then repeated with each successive model attempting to correct for the shortcomings of the combined ensemble of all previous models.
The best possible next model, when combined with previous models, minimizes the overall prediction error. The key idea behind gradient boosting is to set the target outcomes for this next model in order to minimize the error. The target outcome for each case depends on how much a change in that case's prediction impacts the overall prediction error.
If a small increase in the prediction causes a large drop in error for a case, then the next target outcome is a high value. This means that if the new model predicts close to its target, then the error is reduced.
If a small increase in the prediction no change in error for a case, then the next target outcome is zero because changing this prediction does not decrease the error.
The name gradient boosting arises because target outcomes are set based on the gradient of the error with respect to the prediction of each case. Each new model takes a step in the direction that minimizes prediction error in the space of predictions for each training case.
Requirements
A data set containing an outcome variable and predictor variables to use the predictive model.
Method
To run a Gradient Boosting model:
1. Select Create > Classifier > Gradient Boosting.
2. Under Inputs > Gradient Boosting > Outcome select your outcome variable.
3. Under Inputs > Gradient Boosting > Predictor(s) select your predictor variables.
4. Make any other selections as required.
Example
With inputs set as follows:
The chart below shows the relative importance of the predictor variables. The most important variable has an importance of 1. Note that categorical predictors with more than 2 levels are split into individual binary variables. The variables are grouped into clusters of similar importance.
Options
Outcome - The variable to be predicted by the predictor variables. It may be either a numeric or categorical variable.
Predictors - The variable(s) to predict the outcome.
Algorithm - The machine learning algorithm. Defaults to Gradient Boosting but may be changed to other machine learning methods.
Output
-
- Accuracy - Produces measures of the goodness of model fit. For categorical outcomes the breakdown by category is shown.
- Importance - Produces a chart showing the importance of the predictors in determining the outcome. Only available for gbtree booster.
- Prediction-Accuracy Table - Produces a table relating the observed and predicted outcome. Also known as a confusion matrix.
- Detail - Text output from the underlying xgboost package.
Missing data - See Missing Data Options.
Variable names - Displays variable names in the output instead of labels.
Booster - The underlying model to be boosted. Choice between gbtree and gblinear.
Grid search - Whether to search the parameter space in order to tune the model. If not checked, the default parameters of xgboost are used. Increasing this will usually create a more accurate predictor, at the cost of taking a longer time to run.
Random seed - Seed used to initialize the (pseudo)random number generator for the model fitting algorithm. Different seeds may lead to slightly different answers, but should normally not make a large difference.
Increase allowed output size - Check this box if you encounter a warning message "The R output had size XXX MB, exceeding the 128 MB limit..." and you need to reference the output elsewhere in your document; e.g., to save predicted values to a Data Set or examine diagnostics.
Maximum allowed size for output (MB) - This control only appears if Increase allowed output size is checked. Use it to set the maximum allowed size for the regression output in MegaBytes. The warning referred to above about the R output size will state the minimum size you need to increase to to return the full output. Note that having very many large outputs in one document or page may slow down the performance of your document and increase load times.
Weight - Where a weight has been set for the R output , a new data set is generated via resampling, and this new data set is used in the estimation.
Filter - The data is automatically filtered using any filters prior to estimating the model.
DIAGNOSTICS
Prediction-Accuracy Table - Creates a table showing the observed and predicted values, as a heatmap.
SAVE VARIABLE(S)
Predicted Values - Creates a new variable containing predicted values for each case in the data.
Probabilities of Each Response - Creates new variables containing predicted probabilities of each response.
Acknowledgments
Uses the xgboost algorithm from the xgboost package by Tianqi Chen.
More information
This blog post explains gradient boosting.
This post describes an example of predicting customer churn.
Next