What makes a table interesting?
Two crosstabs are shown below. Which of these is interesting? If we are going to automate the process of sifting through crosstabs we need to decide what makes a table interesting. The simplest way to do this is to see if there are meaningful differences between the percentages, reading across the rows. But instead of manually having to read them all, you can automate the process using tests of statistical significance. In our example below, I’ve used colors to denote significant results. This can also be accomplished in different ways like having letters indicate which columns are different from others. Looking at these tables, only one of them jumps out as “interesting” and I bet you picked it right away.
Approach 1: Automatically delete tables that have no significant results
This approach is for those that love deleting things (and efficiency!). All you have to do is have your software automatically scan through all your tables and delete any that do not contain significant results – leaving you with only the statistically significant ones. I’ve got 3,050 crosstabs. By deleting three-quarters of these, I’m saving myself heaps of time and energy. While this is a great way to quickly narrow down your results, you’ll need to be proficient at writing code to script this in R and SPSS. If the thought of writing code makes you wince, use Q or Displayr. They both automate this – meaning it’s a click away.
The first step is to create lots of crosstabs. We do this by:
- Click Create > Tables > Crosstabs
- Select all the variables that you wish to have in the rows of the crosstabs and press OK. In the examples in this post, I’ve selected all of them.
- Select all the variables you wish to have as columns and press OK. For this example, I’ve selected various five-point agreement scales: Allows to keep in touch, Technology fascinating, …, Would like to do mobile banking with phone.
- Choose your report type, choosing the option with just a table per page, as shown to the right and press Create Report.
You will now have many folders, each containing pages cross-tabbing all the variable sets in your project by the key variable sets that you selected. If you are using the phone.sav data set that I am using, you will have almost 2,000 crosstabs!
To see the p-values or z-statistics on any of the tables, right-click on them and select Statistics – Cells; to add multiple statistics at the same time, hold down the Ctrl key on your keyboard.
To delete all the tables that are not significant. We can do this in Q by:
- Selecting all the folders containing tables.
- Home > Utilities > Delete tables and plots and choosing one of the options. The smaller the p-value, the fewer tables that will be left.
Approach 2: Use a heatmap to summarize thousands of crosstabs
While the first approach wins points for effectiveness and ease, it loses points for being binary. In this approach, tables are either black or white, significant or insignificant. There’s no allowance for shades of grey. Luckily, we can enter a technicolor world with an even more powerful approach. Introducing using heatmaps to summarize thousands of crosstabs!
The heatmap below summarizes 3,050 crosstabs. Each colored box shows the degree of statistical significance, where the degree in this case is something called a z-Statistic. Below, I describe a bit more about what the z-Statistic is, but for now, all we need to know is that the darker the box is shaded, the more the underlying table is significant. What can we glean from this? For example:
- The first column shows how other questions in the study were related to agreeing with Allows to keep in touch. Reading down this column we can quickly see that agreement with this attitude can be predicted by Work status, Occupation, and Age. Reading across the rows we can see that there is no other attitude that is related to these three.
- If we scroll down further you will see a white diagonal line of boxes. This is showing the crosstabs of each attitude with each other. Putting aside the white, note how there are a lot more dark cells here. This tells us that the attitudes are highly correlated with each other. Also, note however that there is a lot of variation. Some of the cells are much darker than others, telling us that we can likely group together similar attitudes (e.g., using PCA or cluster analysis).
- Scrolling even further down, you will see that the blue gets very, very, pale for most, but not all, of the variables relating to behavior. We can see two things here. First, the attitudes and behavior in these examples are not closely related. Second, there are a small number of stronger relationships meriting more exploration (i.e., the very dark cells).
Why does the heatmap use z-Statistics?
There are a few issues with the traditional approach of deleting tables that exceed the p-value cutoff for significance (0.05). One of the main issues is that, as they get smaller and closer to 0, it becomes difficult to compare them without having to squint at a lot of decimal places. In the table below, you can see the p-values in the third row of numbers. If you zoom in on the Strongly disagree column, the p-values could be a whole range of numbers like 0.003, or 0.000000001 – it’s simply impossible to tell.
But wait, there’s a solution! The z-Statistics contain the same information as the p-value, except re-scaled to make comparison easier. Check the table below for a handy guide to converting p-Value to z-Statistic. The key value here is that difference between p-values of 0.0001 and 0.0000001 is much bigger when viewed as a z-Statistic making it much easier to understand practical differences between the two.
The heatmap above (it may take a while to load) shows the z-statistics for all 3,050 tables, with darker blue for higher z-scores, and the z-scores capped at 5 (i.e., any value greater than 5 is changed to 5, as beyond 5 the differences are immaterial). The heatmap was created by:
- Selecting all the folders containing all of the tables
- Automate > Browse Online Library> Significance Testing > Identify Interesting Tables. This creates a table called most.significant.results.
- Create > Charts > Visualization > Heatmap
- Set Inputs > DATA SOURCE > Output in ‘Pages’ to most.significant.results (it will be at the very bottom)
- Press CALCULATE. (Note that the table of interesting numbers does not automatically update if the inputs tables are changed; this is the exception that makes the rule that everything in Q automatically updates.).
Approach 3: Smart Tables
The third approach kind of combines the two approaches from above for the best of both. This bit of magic works as follows:
- First, you identify a specific question of interest. For example, if you are wanting to profile a segmentation, then you select the variable that indicates which person is in which segment.
- Select any questions that may be of interest as crosstabs with the question of interest. If you aren’t sure, you just select all the variables.
- Compute statistical significance for each of the crosstabs.
- Delete all the tables that aren’t statistically significant.
- Rank the tables according to statistical significance.
An example of such an output, from Q, is shown below. Like with the heatmap, we end up with an output that allows us to quickly identify the crosstabs of interest.
Smart tables is run in Q as follows:
- Create > Tables > Smart Tables
- Select a single question of interest as the Dependent question. For example, if you are wanting to profile a segmentation, then you select the variable that indicates which person is in which segment.
- Select any other questions that may be of interest as the Independent questions and press OK.