This article describes how to do a driver analysis in Q and create the outputs. Driver analysis, which is also known as key driver analysis, importance analysis, and relative importance analysis, uses the data from questions to work out the relative importance of each of the predictor variables in predicting the outcome variable. There are various driver analysis methods available that you can use, see How to Select the Regression Type for Driver Analysis. Our webinar and eBook provide more in depth information on considerations and the steps to follow when doing driver analysis, and this article runs through this information at a higher-level.
Requirements
- A data set containing the outcome and predictor variables that you want to use as inputs to the driver analysis.
- Predictor variables are numeric or binary (Number, Number-Multi, Number-Grid)
Before you begin, take some time to review the 10 steps of driver analysis outlined in the eBook that you will work through while performing driver analysis:
- Check that driver analysis is appropriate (see Chapter 1).
- Transform the predictors so that they are either numeric or binary (see Chapter 2 of ebook and Variable Sets article for how to confirm and modify).
- Recode, reorder, and merge categories so that they are ordered from lowest to highest (see
Chapter 3). - Often driver analysis is performed using data for multiple brands at the same time. Traditionally, this is addressed by creating a new data file that stacks the data from each brand on top of each other (see What is Data Stacking?). If you want to do this or have other repeated measures data, choose the Stack data option (see Chapter 4).
- Choose the appropriate Regression type (see Chapter 5).
- If you have numeric data, choose Output as Shapley Regression (see Chapter 6 - Use
Shapley Regression or Johnson’s Relative Weights). - Choose the appropriate Missing data option (see Chapter 7).
- Review the various technical diagnostics, such as testing for outliers and heteroscedasticity
(see Chapter 8). - Review the p-values and standard errors (see Chapter 9).
- Present the results in the best way (see Chapter 10).
Method
- Load a data set that contains the variables that you will use as inputs for the driver analysis.
- From the toolbar, select Create > Regression > Driver Analysis.
- Decide whether you want to stack your data using the Stack data option, see How To Stack Data for Driver Analysis on how to do this and what variables to pick for steps 4 and 5 below.
- From the object inspector > Inputs, select the Outcome variable.
- Select the Predictor(s) variables.
- Select the Algorithm > Regression.
- Select the Regression type from the drop-down (see: How to Select the Regression Type for Driver Analysis).
- Select the Output in the object inspector.
- Summary: intercept, unstandardized regression coefficients, t-values, and significance tests.
- Detail: Typical R output, some additional information compared to Summary, but without the pretty formatting.
- ANOVA: Analysis of variance table containing the results of Chi-squared likelihood ratio tests for each predictor.
- Relative Importance Analysis: The results of a relative importance analysis. See here and the references for more information. This option is not available for Multinomial Logit. Note that categorical predictors are not converted to be numeric, unlike in Driver (Importance) Analysis - Relative Importance Analysis.
- Correlation: This method is appropriate when you are unconcerned about correlations between predictor variables. It computes the relative importance of the predictor variables against the outcome variable via the bivariate Pearson product moment correlations. See Driver (Importance) Analysis - Correlation and references therein for more information.
- JaccardCoefficient: Note that Jaccard Coefficient is only available when Regression type is set to Linear and requires binary variables for the outcome variable and the predictor variables. It computes the relative importance of the predictor variables against the outcome variable with the Jaccard Coefficients. See Driver (Importance) Analysis - Jaccard Coefficient.
- Shapley Regression: Note that Shapley Regression is only available when Regression type is set to Linear. This a regularized regression, designed for situations where linear regression results are unreliable due to high correlations between predictors. See here and the references for more information. Note that categorical predictors are not converted to be numeric, unlike in Driver (Importance) Analysis - Shapley.
- Effects Plot: Plots the relationship between each of the Predictors and the Outcome. Not available for Multinomial Logit.
- By default, all the driver analysis methods exclude all cases with missing data from their analysis (this occurs after any stacking has been performed). However, there are additional Missing data options that can be relevant. You can read more about the missing data options in Chapter 7 of the eBook.
- If using Correlation, Jaccard Coefficient, or Linear Regression, you can select Use partial data (pairwise correlations), in which case the data is analyzed using all the available data. Even when not all the predictors have data, partial information is used for each case.
- If using Shapley Regression, Johnson's Relative Weights (Relative Importance Analysis), or any of the GLMs and quasi-GLMs, Multiple imputation can be used. This is generally the best method for dealing with missing data, except for situations the Dummy variable adjustment is appropriate.
- If using Shapley Regression, Johnson's Relative Weights (Relative Importance Analysis), or any of the GLMs and quasi-GLMs, Dummy variable adjustment can be used. This method is appropriate when the data is missing because it cannot exist. For example, if the predictors are ratings of satisfaction with a bank's call centers, branches, and website, if data is missing for people that have not attended any of these, then this setting is appropriate. By contrast, if the data is missing because the person didn't feel like providing an answer, multiple imputations are preferable.
- Select Correction. This is the multiple comparisons correction applied when computing the p-values of the post-hoc comparisons. The default is None.
- OPTIONAL: Check Variable names. This will display variable names in the output instead of labels.
- OPTIONAL: Check Absolute importance scores. This determines whether the absolute value of Relative Importance Analysis scores should be displayed.
These calculations assume that all predictors have a positive effect (conceptually, it’s similar to the idea that a variable can only have a positive impact on R-square). However, unless you check this box, it is possible for you to get a negative sign instead. When we estimate standard ordered logit models within each of the sub-groups, we use this to determine the sign. It’s really just a diagnostic to warn you that the assumption of positive effect may be inappropriate.
-
OPTIONAL: Select Auxiliary variables which are the variables to be used (in addition to all the other variables in the model) when Multiple imputation is selected for Missing data.
- OPTIONAL: Check Robust standard errors which computes standard errors that are robust to violations of the assumption of constant variance (i.e., heteroscedasticity). See Robust Standard Errors. This is only available when Type is Linear.
- OPTIONAL: Select a categorical variable in Crosstab interaction. The result is a crosstab that shows the importance scores by each unique value of the categorical variable, with bold showing significant differences and color-coding showing relativities. Coefficients in the table are computed by creating separate regressions for each level of the interaction variable. To evaluate whether a coefficient is significantly higher (blue) or lower (red), we perform a t-test of the coefficient compared to the coefficient using the remaining data. P-values are corrected for multiple comparisons across the whole table (excluding the NET column). The P-value in the sub-title is calculated using a the likelihood ratio test between the pooled model with no interaction variable, and a model where all predictors interact with the interaction variable.
- OPTIONAL: Update Automated outlier removal percentage, see How To Automatically Remove Outliers from Regression and GLMs for important details on this feature.
- OPTIONAL: Adjust Random seed. Seed used to initialize the (pseudo)random number generator for the model fitting algorithm. Different seeds may lead to slightly different answers, but should normally not make a large difference.
-
OPTIONAL: Apply a filter if you want to create a segmentation for a specific subgroup. The data is filtered prior to estimating the model.
-
OPTIONAL: Select a weight if you want the input variables weighted. It will automatically be applied when the model is estimated. By default, the weight is assumed to be a sampling weight, and the standard errors are estimated using Taylor series linearization. See Weights, Effective Sample Size and Design Effects.
-
OPTIONAL: Tick Increase allowed output size if you encounter a warning message "The R output had size XXX MB, exceeding the 128 MB limit..." and you need to reference the output elsewhere in your document; e.g., to save predicted values to a Data Set or examine diagnostics.
-
OPTIONAL: Maximum allowed size for output (MB). This control only appears if Increase allowed output size is checked. Use it to set the maximum allowed size for the regression output in Megabytes. The warning referred to above about the R output size will state the minimum size you need to increase to to return the full output. Note that having very many large outputs in one document or page may slow down the performance of your document and increase load times.
Furthermore, you can review various diagnostics of the driver analysis in the DIAGNOSTICS section of the object inspector, including reviewing diagnostic plots. See details of how to run these in the Next section below and further details in the eBook.
Technical Details
There are two main approaches offered to determine the importance of variables in a driver analysis, Shapley regression and Relative Importance Analysis. Both techniques consider a way to decompose the contribution each predictor variable (driver) has towards the outcome variable. Shapley regression is localized to linear regression and is an approach that uses an exhaustive search of all possible linear regression models to compute the contribution each predictor variable has in the R-square statistic. Relative importance analysis takes a different approach to use an orthogonal representation of the predictor variables and reconciles this with the original variables to determine their contribution. For further details, please read the Technical Details section of Regression - Driver Analysis.
Next
How to Run and Interpret Shapley Regression
How to Select the Regression Type for Driver Analysis
How To Automatically Remove Outliers from Regression and GLMs