Fits a support vector machine for classification or regression.
Requirements
A data set containing an outcome variable and predictor variables to use the predictive model.
Method
To create a Support Vector Machine:
1. In Q, select Create > Classifier > Support Vector Machine.
2. Under Inputs > Support Vector Machine > Outcome select your outcome variable.
3. Under Inputs > Support Vector Machine > Predictor(s) select your predictor variables.
Examples
Categorical outcome
The table below shows the Accuracy as computed by a Support Vector Machine. The Overall Accuracy is the percentage of instances that are correctly categorized by the model. The accuracies of each individual class are also displayed. In this example, the model is best at correctly predicting Group C.
The Prediction-Accuracy Table gives a more complete picture of the output, showing the number of observed examples for each class that were predicted to be in each class. In this example, 33 instances of Group B are wrongly predicted to be Group A.
Numeric outcome
The tables below show the Support Vector Machine outputs for a numeric outcome variable. Accuracy displays two measures of performance : Root Mean Square Error (the square root of the average squared difference between the predicted and target outcomes) and R-squared (a measure of the fraction of variation in the data that is explained by the model).
For a numeric outcome variable, the Prediction-Accuracy Table is generated by bucketing the predicted and target outcomes and indicating when the bucket of a predicted example does or does not match its observed bucket.
Options
Outcome - The variable to be predicted by the predictor variables. It may be either a numeric or categorical variable.
Predictors - The variable(s) to predict the outcome.
Algorithm - The machine learning algorithm. Defaults to Support Vector Machine but may be changed to other machine learning methods.
Output
-
- Accuracy - Produces measures of the goodness of model fit, as illustrated above.
- Prediction-Accuracy Table - Produces a table relating the observed and predicted outcome. Also known as a confusion matrix.
- Detail - This returns the default output from svm in the e1071 package.
Missing data - See Missing Data Options.
Variable names - Displays variable names in the output instead of labels.
Cost - Controls the extent to which the model correctly predicts the outcome for each training example. Low values of cost maximize the margin between the classes when searching for a separating hyperplane, with the trade-off that certain examples may be misclassified (i.e. lie on the wrong side of the hyperplane). High values of cost result in a smaller margin of separation between the classes and fewer misclassifications. Lowering the cost has the impact of increasing the regularization, which implies higher bias / lower variance and thus controls overfitting. Raising the cost increases the flexibility of the model but for extreme values will decrease the ability to generalize predictions to unseen data. A typical range of cost to explore would be 0.0001 to 10000.
Random seed - Seed used to initialize the (pseudo) random number generator for the model fitting algorithm. Different seeds may lead to slightly different answers, but should normally not make a large difference.
Increase allowed output size - Check this box if you encounter a warning message "The R output had size XXX MB, exceeding the 128 MB limit..." and you need to reference the output elsewhere in your document; e.g., to save predicted values to a Data Set or examine diagnostics.
Maximum allowed size for output (MB) - This control only appears if Increase allowed output size is checked. Use it to set the maximum allowed size for the regression output in MegaBytes. The warning referred to above about the R output size will state the minimum size you need to increase to return the full output. Note that having very many large outputs in one document or page may slow down the performance of your document and increase load times.
Weight - Where a weight has been set for the R output, a new data set is generated via resampling and this new data set is used in the estimation.
Filter - The data is automatically filtered using any filters prior to estimating the model.
DIAGNOSTICS
Prediction-Accuracy Table - Creates a table showing the observed and predicted values, as a heatmap.
SAVE VARIABLE(S)
Predicted Values - Creates a new variable containing predicted values for each case in the data.
Probabilities of Each Response - Creates new variables containing predicted probabilities of each response.
Acknowledgments
Uses the svm algorithm from the e1071 package.
Cortes, C.; Vapnik, V. (1995). "Support-vector networks". Machine Learning. 20 (3): 273–297
More information
This blog post explains the concept of support vector machines.
The process of determining the cost parameter is described here.
Next