This article describes how to impute missing data and add new imputed variables to your data set.
A popular approach to dealing with missing data is to use a technique called imputation, which seeks to guess the value of the missing data.
Click here for more information on when to use imputation.
Requirements
A data file loaded into Q with at least one instance of missing data.
Method
- Select one or more variables or questions in the Variables and Questions tab that contains missing data.
- Select Automate > Browse Online Library > Create New Variables > Impute Missing Data.
This will add an imputed variable for each of the variables selected in step 1 containing "imputed" in the Name and Question.
To change how the imputation is performed:
- Select one of the new imputed Questions in the Variables and Questions tab.
- Right-click and select Edit R Variable.
- Choose the desired options in the Inputs section on the right. These options are explained below.
- OPTIONAL: Auxiliary variables: You can add additional variables to this drop-box to use the data from those variables in the imputation. These variables' data are used to inform the imputation, but are not themselves added to the data set.
- OPTIONAL: Seed: This is the random number seed used in the imputation. Changing this number will result in a different solution.
-
OPTIONAL: Method:
-
Try mice: The imputation will initially try to use the mice (Multivariate Imputation by Chained Equations in R) algorithm, and if this is not successful it will attempt to use the hotdeck algorithm.
-
Hot Deck: Force the imputation to only use the hotdeck algorithm.
-
Mice: Force the imputation to only use the mice algorithm.
-
- Click Update R Variable.
Technical Details
By default, data is imputed using the default settings from the mice R package, which employs Multivariate Imputation by Chained Equations (predictive mean matching) [1]. Care should be taken to ensure that variables have the correct variable type, as this has a big impact on this algorithm. Where a technical error is experienced using mice, the imputation is performed using hot-decking, via the hot.deck package in R.[2]
When applied with regression, missing values in the outcome variable are excluded from the analysis after the imputation has been performed.[3]
Note that although imputation can reduce the bias of parameter estimates, it can create misleading statistical inference (e.g., as the simulated sample size is assumed to be the actual sample size in calculations).
The new Questions are imputed jointly. This means that if you make changes to one of them then the others will also change.
There are some technical limitations with regards to how you can change the new variables:
- You cannot add or remove variables from the Variables drop-box.
- You cannot change the order of variables in the Variables drop-box.
- If you wish to delete any of the imputed variables you must delete them all together because they are linked.
1. Stef van Buuren and Karin Groothuis-Oudshoorn (2011), "mice: Multivariate Imputation by Chained Equations in R", Journal of Statistical Software, 45:3, 1-67.
2. Skyler J. Cranmer and Jeff Gill (2013). We Have to Be Discrete About This: A Non-Parametric Imputation Technique for Missing Categorical Data. British Journal of Political Science, 43, pp 425-449.
3. von Hippel, Paul T. 2007. "Regression With Missing Y's: An Improved Strategy for Analyzing Multiply Imputed Data."
See Also
How to Recode Missing Values in Q
How to Hide and Remove Categories (Missing Values)