This article describes how to run Shapley Regression in Q. Shapley Regression is also known as Shapley Value Regression and is a leading method for driver analysis. It calculates the importance of different predictors in explaining an outcome variable and is prized for its ability to address multicollinearity.
Requirements
A data set containing variables you want to use as inputs in the driver analysis. Often driver analysis is performed using data for multiple brands at the same time. Traditionally this is addressed by creating a new data file that stacks the data from each brand on top of each other (see What is Data Stacking?). However, when performing driver analysis in Displayr, the data can be automatically stacked.
Method
To compute Shapley Regression in Displayr:
- Go to Create > Regression > Driver Analysis.
- In the object inspector on the right of the screen, select the Outcome and Predictor(s). These should be structured (see Question Types for how to confirm and modify):
- As numeric (e.g. Numeric or Number - Multi in Variables and Questions tab).
- The higher levels of performance/satisfaction have higher numbers/values (this isn't a technical requirement, but it makes interpretation easier).
- Change Output to Shapley Regression.
- The model output will look similar to below:
-
To interpret this output, the first column shows the estimated Importance of the drivers. We can see that 'Network Coverage' is the most important. The absolute values of these importance scores add to 100. Note that we have a negative value for 'Cancel your subscription/plan'. This is a special feature of our Shapley Regression. In the background, a traditional linear regression is run and uses its signs in the Shapley, as a way of alerting the user to the possibility that some of the effects may be negative. You can turn this feature off by selecting the option Absolute importance scores from object inspector > Inputs > Linear Regression. The second column shows the Raw score, which is the same as Importance, except that rather than adding up to 100, it adds up to the R-squared statistic, which in this case is 0.3871 (shown in the footer). Thus, we can say that Network coverage, for example, explains 7.3% of the variance in Net Promoter Score (the outcome variable).