This article describes how to:
- Identify problems and bad data by checking sample sizes, screening criteria, routing and filtering instructions, duplicates
- Clean your data from recording and rebasing values to fixing metadata to deleting dodgy respondents
Start with a high-quality dataset
A more difficult dataset to work with would not include metadata. Some common culprits of being metadata-poor are fixed column text files, comma-delimited text files, Excel files, CSV files, SQL databases, XML files, JSON text files, and HTML.
If at all possible, use a metadata-rich file like SPSS .sav files and Q packs.
This will prevent having to look up and cross-reference information and help reduce errors in analysis.
Inspect the data
Step one is to check our sample size. If your sample size is bigger than expected, then you probably have respondents with incomplete data polluting your survey.
The next is to check your screening criteria worked as intended, and the fastest way to do so is with a Sankey Diagram. To create a Sankey Diagram, use Create > Charts > Visualization > Exotic > Sankey
This Sankey Diagram lets us quickly distinguish between respondents who have been flagged as excluded from the quota and those that are complete and in quota – meaning that they were aged over 18, had a known gender, and consumed cola at least once a month.
What about the data quality for each question and variable? Here are some things to check for: poor-metadata, unusual sample sizes, outliers and other funky values, too small categories, incorrect variable types, and incorrect variable sets. Check out the eBook for more detail.
Routing and filtering instructions
You’ll also want to check your routing and filtering instructions. You could scan through your raw data, but it’s time-consuming and easy to miss exceptions – especially if you have a large dataset. Here’s where the Sankey diagram comes in handy again.
Our table shows that those of a full-time work status were asked for their occupation. However, our Sankey diagram immediately clues us into the fact that some students and part-time workers were also asked for their occupation – indicating a problem with our survey routing.
Missing data patterns
A useful way for checking for missing data patterns is visually and can best be done with a heatmap. Q has some integrated features to plot missing data and can be created using Automate > Browse Online Library > Missing Data > Plot by Case.
Patterns can be found in the dataset using Automate > Browse Online Library > Missing Data > Plot of Patterns.
Different lines or clusters of lines indicate different problems with missing data. For example, long horizontal lines (highlighted on the left) can indicate observations with severe missing data issues. See the eBook for more details.
Duplicates can be a serious problem. You can create a table of the ID variable and sort by descending values, so any duplicates will show at the top of your table.
It’s also super useful to set up some unit tests to automate the process of checking for errors. This is an especially good investment when it comes to longitudinal or tracking projects as unit tests can automatically check key things whenever new data is uploaded. Some common things to test for in unit tests include: out-of-range errors, flow errors, variable non-response, variable consistency errors, lie tests, and sum constraint errors.
Clean the Data
Recoding and rebasing
Often you may want to set inconsequentially small values of “don’t know” to be missing values. This tells your analysis software to automatically filter the table and recompute the values with the “don’t knows” excluded.
You can easily do this by selecting the value for a variable and choosing the labels you'd like to remove from the base of your question:
This is also known as rebasing. Other common ways of recoding values include capping and re-aligning values with your labels.
Example of capping values at 30:
You can also merge small categories. In some software, this is regarded as another example of recoding. In Q or Displayr, merging is considered a separate process, since merging does not affect the underlying values of data.
Typically, you can fix metadata by adding or correcting labels.
When a respondent’s data is deemed non-useful (whether it’s because they flat-lined or had too many missing values etc.) we delete that row of the data file. The most useful way to do so is via filters.