Basic Workflow For Checking and Cleaning a Project

The basic workflow for checking and cleaning a project involves working through the activities described in the sections on this page. Other than needing to begin by importing data, the steps can be done in any order.

Importing the data

Checking the data file

Checking the data

Hiding irrelevant data

Checking Question Types

Reviewing the Value Attributes

Tidying Names and Labels

Checking questionnaire skips

Changing data (i.e., changing a respondent's values)

Back-coding 'other specifies'

Deleting cases

Method

Importing the data

To start a project, select File > Data Sets > Add to Project > From File, and select a data file you wish to analyze in Q. If you already have a project open, you should first select File > New Project.

You will then be shown the Data Import window which gives options for how the data file will be imported. By default, the option Automatically detect data file structure, e.g. group variables into "questions" (recommended) will be chosen, and you should use this option in most cases. In the Advanced section, the option Tidy up Variable Labels will be selected. If you keep this option, which is usually appropriate, Q will strip out any repetitive text that appears in labels within a multiple-response question (e.g., if the labels were Satisfaction: Citibank and Satisfaction: Bank of America, Q will replace these with Citibank and Bank of America). The Advanced option Strip HTML from Labels will remove HTML tags which are sometimes present in labels from online surveys. You can read more about each of these options in How to Import Datasets in Q.

Many of the QScripts that are mentioned in the remaining sections of this article appear as options in the Advanced section. You can select them to run at the time of the file import, or you can run them later on by choosing them from Automate > Browse Online Library > Preliminary Project Setup, or by typing the word "setup" into the Search features and data box in the top right corner of the Q window and clicking in the QScripts and Rules section of the search results.

Checking the data file

Sometimes data files contain errors that can make data analysis difficult and, sometimes, impossible. Q has a QScript for checking for the most common problems. To run this QScript:

Type setup into the Search features and data box at the top of the Q window.
Click on the QScripts and Rules section.
Select How to Check for Errors in Data File Construction.

This script will scan through the data in your project, looking for common errors in the data file setup. It will present a set of tables that highlight the errors so that you can address them, or if they are serious, ask your data provider to fix them and send you a new copy of your data.

Errors that the script tries to identify include:

When a variable is the wrong Variable Type. For example, numeric data is stored in a text variable.
Incorrect Missing Data settings in binary variables.
Missing labels.

A full list is available in the documentation for the script.

Checking the data

To create a table containing basic summary data:

Type "setup" into the Search features and data box at the top of the Q window.
Click on the QScripts and Rules section.
Select Preliminary Project Setup - Summary Tables.
Review the tables and address any problems. In particular:
- Check that the NET value is sensible. See How to Investigate a Sample Size or NET that is Too Small.
- Look at the base n at the bottom of the table, as it will often highlight data integrity issues. If it shows a range of values (e.g., base n = from 120 to 139) this indicates that different cells on the tables vary in their sample size. Use Statistics - Cells > Base n, n to explore this in more detail. Where the base n shows a really low number (e.g., base n = 0 to 139), this generally indicates either a problem with the Value Attributes or that the NET or SUM on the table should be hidden (right-click on it and select Hide). See How to Investigate a Sample Size or NET that is Too Small for more information.
Run the QScript called How to Create Tables for Data Checking, which will focus on creating tables that contain results that are automatically identified as requiring attention (e.g., tables with very small cell counts).

Another way to view sample size information by variable is to go to the Variables and Questions tab and press the icon in the toolbar, which will compute the minimum, maximum, mean, and sample size for all the variables in the database.

See Using Scripts to Automate Data Checking and Cleaning for more advanced options.

Hiding irrelevant data

If a table is showing information that you think the user will not want to see (e.g., administrative records):

Press the blue arrow () to the right of the Blue Drop-Down menu to select the variable in the Variables and Questions tab and hide the question (by pressing on ).
Return to the Outputs tab and delete the table.

Alternatively, you can get Q to automate this processing using How to Hide Uninteresting Data.

Checking Question Types

Q automates many analyses using Question Type, so ensuring that the Question Type has been set correctly is an essential step in checking a project. If your data file has been created in a good way (see File Formats Supported by Q) then nearly all or nearly all the question types will be automatically set up correctly within Q.

The Question Type used by Q can be identified by looking at the Question Type column next to the relevant variables in the Variables and Questions tab (if you are on the Outputs Tab, press the blue arrow () to the right of the drop-down menu that contains the question you wish to review). The Question Type can be modified as follows:

If Q has inadvertently grouped together multiple questions, where they should have been kept separate, select the relevant variables, right-click on the selected variables, and select Revert to source. Once this has occurred, create new tables in the Outputs tab for each of the variables.
If Q has failed to group together multiple variables as a question, select the variables in the Variables and Questions tab, right-click, and select Set Question.
If the variables have been correctly grouped, but the Question Type is wrong, change the Question Type.

The online tutorial on Multiple Response Questions and the ensuing tutorials in the Manipulating Data > Questions section of Online Training provides worked examples.

If the Status column in the Variables and Questions tab is showing any yellow cells then means that you need to review the Value Attributes (in general, you should review the Value Attributes anyway if new to Q or if you have obtained data from a new supplier).

Reviewing the Value Attributes

If you right-click on a row label or column header on a table and select Values, Q will open the Value Attributes dialog box. Within this dialog box, you can:

Set data as missing by checking options in the Missing Data column, which will cause percentages and means to be recomputed.
Recode data by editing the contents of the Value column. In particular, if a category is not checked in the Missing Data column, the Value shown will be used when computing averages, medians, and other non-categorical statistics. Consequently, it is often useful to replace the Value in the data set with some other more useful value. Common things to change are:
- Using a Value that more accurately reflects the label (e.g., a midpoint). See How to Calculate an Average Value from Categorical Data in Q.
- Setting categories that correspond to missing data to 0, where it can be deduced that the data is missing because 0 is the appropriate answer (e.g. if respondents were asked their purchase frequency only if they were aware of the product).
- Setting the value for Don't know responses to NaN. See Recode - Set the Value of Don't Knows to NaN.
Modify the categories used in computing percentages on Pick Any and Pick Any – Grid questions using Count This Value.

There are many other ways of doing each of these operations in Q. For example, if a table shows categories that you wish to specify as missing, right-click on the categories and select Remove. Alternatively, right-click on the category select Values, and modify the Value Attributes.

Tidying Names and Labels

If the names of questions appear to be very short then it may be possible to obtain better question names from the labels in the raw data using the QScript: How to Suggest Better Question Names From Source Labels.

Similarly, if labels appear messy, with information about the question included, then it may be possible to tidy them up using the QScript: How to Remove Truncated Text From Variable Labels.

Checking questionnaire skips

There are four different ways of checking questionnaire skips within Q:

Create filters from questions that were used to determine the skips and apply these to tables in the Outputs Tab.
On the Variables and Questions tab, press which shows the sample size for each variable.
On the Data tab, sort any variables that are used as skips (which causes the other variables rows to be aligned with the variable used in the skips).
Use a QScript. In particular, see Checking for Invalid Data.

Changing data (i.e., changing a respondent's values)

There are a variety of different tools for changing respondents' data, including:

Manually changing the values on the Data tab. Note that you will need to specify the Case IDs on the Data tab (top left corner) before editing the data otherwise, Q cannot ensure that the changes are remembered if the data file is updated. When manually changing the values there are a few tricks to doing it quickly:
- You can split the Data tab into two sections (by dragging one of the blocks positioned at the top or left side of the scroll bars), and you can also sort the data according to any column by right-clicking the column header and selecting Sort data by this column.
- To quickly find variables you can click the blue arrows to the right of the Blue and Brown Drop-down Menus of the Outputs Tab to select variables on the Variables and Questions tab and then the blue arrows on this tab to select the variables on the Data tab. (If you cannot see your variable, click the down button on your cursor to find out where you are.) You can also find variables on the Data tab by selecting Edit > Find Variable.
Recoding data by modifying the Value Attributes.
Taking a linked copy of variables or questions and using a formula to conditionally recode the data (see How to Conditionally Recode into a New Variable). Note that this can be done for large numbers of variables by using Search/Replace and Use as Template for Replication.
Using QScripts in the Online Library. In particular:
- How to Create New Variables with Outliers Removed
- How to Identify Questions with Straight-Lining/Flat-Lining
Using a QScript. The basic logic is similar to taking linked copies (i.e., the original variables are copied and modified) and this approach is generally only appropriate if you are trying to manipulate large amounts of data, or, have already completed much of your project setup and wish to avoid redoing things. See Automatically Recoding Into New Variables and Rebasing Questions Using a Condition for examples of QScript related to this problem.

Back-coding 'other specifies'

See How to Back Code Other Specify Responses.

Deleting cases

If you identify that the data contains cases (respondents) that should be deleted when cleaning and checking the data, this can be done in the Data tab by right-clicking on rows and selecting Delete Row. Deleted cases are not deleted from the data file, but they are excluded from any analyses. You can return deleted rows by right-clicking and selecting Revert Deleted Rows.

How to Setup Data in Q

Articles in this section