Introduction
Boosting is a method for combining a series of simple individual models to create a more powerful model. An initial model (either tree or linear regression) is fitted to the data. A second model is then built that focuses on accurately predicting for the cases where the first model performs poorly relative to its target outcomes. The combination of these two models is expected to be better than either model alone. The process is then repeated with each successive model attempting to correct for the shortcomings of the combined ensemble of all previous models.
The best possible next model, when combined with previous models, minimizes the overall prediction error. The key idea behind gradient boosting is to set the target outcomes for this next model in order to minimize the error. The target outcome for each case depends on how much a change in that case's prediction impacts the overall prediction error.
If a small increase in the prediction causes a large drop in error for a case, then the next target outcome is a high value. This means that if the new model predicts close to its target, then the error is reduced.
If a small increase in the prediction no change in error for a case, then the next target outcome is zero because changing this prediction does not decrease the error.
The name gradient boosting arises because target outcomes are set based on the gradient of the error with respect to the prediction of each case. Each new model takes a step in the direction that minimizes prediction error in the space of predictions for each training case.
Requirements
A data set containing an outcome variable and predictor variables to use the predictive model.
Method
To run a Gradient Boosting model:
1. Select Create > Classifier > Gradient Boosting.
2. Under Inputs > Gradient Boosting > Outcome select your outcome variable.
3. Under Inputs > Gradient Boosting > Predictor(s) select your predictor variables.
4. Make any other selections as required.
Example
With inputs set as follows:
The chart below shows the relative importance of the predictor variables. The most important variable has an importance of 1. Note that categorical predictors with more than 2 levels are split into individual binary variables. The variables are grouped into clusters of similar importance.
Options
Outcome  The variable to be predicted by the predictor variables. It may be either a numeric or categorical variable.
Predictors  The variable(s) to predict the outcome.
Algorithm  The machine learning algorithm. Defaults to Gradient Boosting but may be changed to other machine learning methods.
Output

 Accuracy  Produces measures of the goodness of model fit. For categorical outcomes the breakdown by category is shown.
 Importance  Produces a chart showing the importance of the predictors in determining the outcome. Only available for gbtree booster.
 PredictionAccuracy Table  Produces a table relating the observed and predicted outcome. Also known as a confusion matrix.
 Detail  Text output from the underlying xgboost package.
Missing data  See Missing Data Options.
Variable names  Displays variable names in the output instead of labels.
Booster  The underlying model to be boosted. Choice between gbtree and gblinear.
Grid search  Whether to search the parameter space in order to tune the model. If not checked, the default parameters of xgboost are used. Increasing this will usually create a more accurate predictor, at the cost of taking a longer time to run.
Random seed  Seed used to initialize the (pseudo)random number generator for the model fitting algorithm. Different seeds may lead to slightly different answers, but should normally not make a large difference.
Increase allowed output size  Check this box if you encounter a warning message "The R output had size XXX MB, exceeding the 128 MB limit..." and you need to reference the output elsewhere in your document; e.g., to save predicted values to a Data Set or examine diagnostics.
Maximum allowed size for output (MB)  This control only appears if Increase allowed output size is checked. Use it to set the maximum allowed size for the regression output in MegaBytes. The warning referred to above about the R output size will state the minimum size you need to increase to to return the full output. Note that having very many large outputs in one document or page may slow down the performance of your document and increase load times.
Weight  Where a weight has been set for the R output , a new data set is generated via resampling, and this new data set is used in the estimation.
Filter  The data is automatically filtered using any filters prior to estimating the model.
DIAGNOSTICS
PredictionAccuracy Table  Creates a table showing the observed and predicted values, as a heatmap.
SAVE VARIABLE(S)
Predicted Values  Creates a new variable containing predicted values for each case in the data.
Probabilities of Each Response  Creates new variables containing predicted probabilities of each response.
Acknowledgments
Uses the xgboost algorithm from the xgboost package by Tianqi Chen.
More information
This blog post explains gradient boosting.
This post describes an example of predicting customer churn.
Next
Comments
0 comments
Article is closed for comments.