Introduction
This article describes how to calculate Jaccard coefficients in Q using R.
One way of measuring the overlap or similarity between the data in two binary variables is to use a Jaccard coefficient. The coefficient ranges between 0 and 1, with 1 indicating that the two variables overlap completely, and 0 indicating that there are no selections in common. In this post I show you how to do the calculation in Q using R by looking at how people’s preferences for confectionery flavors overlap according to their responses to a survey.
Requirements
The variables for the Jaccard calculation must be binary, having values of 0 or 1. They may also include a missing value, and any case with a missing value in each pair will be excluded from the Jaccard coefficient for that pair.
In Q, this means that your variables must come from a Number, Number – Multi, or Pick Any question. You can check and change the Question Type by looking in the Question Type column of the Variables and Questions tab.
Method
To calculate Jaccard coefficients for a set of binary variables, you can use the following:
- Select Create > R Output.
- Paste the code below into the R CODE section on the right.
- Change line 8 of the code so that input.variables contains the variable Name of the variables you want to include. The variable Name is the entry in the Name column of the Variables and Questions tab.
- Click Automatic.
The code for the Jaccard coefficients is:
Method
To calculate Jaccard coefficients for a set of binary variables, you can use the following:
- Select Calculation
> Custom Code.
- Place the custom calculation on the page.
- Paste the code below into the R Code editor.
- Change line 8 of the code so that input.variables contains the variable Name of the variables you want to include. The variable Name can be found by hovering over the variable in the Data Sources tree, or by selecting the variable and looking under General > Name.
The code for the Jaccard coefficients is:
#create a function to calculate jaccard coefficient between binary variables
#note for each pair of variables, any case with a missing value for that pair should be excluded from the coefficient for that pair
#the function only does comparisons between complete cases, that is cases that do not have any blank values in any of the variables
Jaccard = function (x, y) {
complete = complete.cases(x,y)
x = x[complete]
y = y[complete]
M.11 = Sum(x == 1 & y == 1)
M.10 = Sum(x == 1 & y == 0)
M.01 = Sum(x == 0 & y == 1)
return (M.11 / (M.11 + M.10 + M.01))
}
#CHANGE the variables inside the data.frame() to those you want to include by using the variable name
input.variables = data.frame(Q6_01, Q6_02, Q6_03, Q6_04, Q6_05, Q6_06, Q6_07, Q6_08, Q6_09)
#Create an empty matrix of missing values to write results to
m = matrix(data = NA, nrow = length(input.variables), ncol = length(input.variables))
for (r in 1:length(input.variables)) {
for (c in 1:length(input.variables)) {
if (c == r) {
m[r,c] = 1
} else if (c > r) {
m[r,c] = Jaccard(input.variables[,r], input.variables[,c])
}
}
}
#pull off the variable labels to use as the row and column headers
variable.names = sapply(input.variables, attr, "label")
colnames(m) = variable.names
rownames(m) = variable.names
#return the final table and name it jaccards
jaccards = m
In this code:
- I have defined a function called Jaccard. The function takes any two variables and calculates the Jaccard coefficient for those two variables. A function is a set of instructions that can be used elsewhere in the code. Particularly for more complicated blocks of code, writing a function like this can make your code more efficient and easier to read and check for mistakes.
- In case of missing values, the Sum function excludes any case with a missing value for that pair from the coefficient for that pair.
- input.variables contains a data frame which has each of the variables you want to analyze as the columns. Use the reference Name of the variables for this, otherwise, see Code edits for variable sets below.
- I have used two for loops to go through and calculate the Jaccard coefficients and fill up the top half of the matrix.
- The bottom half of the matrix is left empty. In Displayr, missing values are displayed as empty cells. As the bottom half of the matrix would be identical to the top half, empty cells help us to read the results more easily.
- I have used the sapply function to obtain the labels for each variable so that they may be displayed in the row labels (rownames) and column labels (colnames) of the table. In this case, sapply is using the attr function to obtain the label attribute of each variable. As R does not recognize the same set of metadata for each variable, Displayr adds the metadata to the attributes of the variables so that it may be returned later if necessary.
The result is a table that contains all of the Jaccard coefficients for each pair of variables.
Code edits for variable sets
If you are working with a lot of variables and would rather reference them using the variable set name, please use the code below.
#create a function to calculate jaccard coefficient between binary variables
#note for each pair of variables, any case with a missing value for that pair should be excluded from the coefficient for that pair
Jaccard = function (x, y) {
M.11 = Sum(x == 1 & y == 1)
M.10 = Sum(x == 1 & y == 0)
M.01 = Sum(x == 0 & y == 1)
return (M.11 / (M.11 + M.10 + M.01))
}
#CHANGE the variables inside the data.frame() to those you want to include by using the variable name
input.variables = data.frame(`Percieved proportion of time`, `Unaided Awareness`, check.names=F)
input.variables=input.variables[,names(input.variables) != "NET"]
#Create an empty matrix of missing values to write results to
m = matrix(data = NA, nrow = length(input.variables), ncol = length(input.variables))
for (r in 1:length(input.variables)) {
for (c in 1:length(input.variables)) {
if (c == r) {
m[r,c] = 1
} else if (c > r) {
m[r,c] = Jaccard(input.variables[,r], input.variables[,c])
}
}
}
#pull off the variable names to use as the row and column headers
variable.names = names(input.variables)
colnames(m) = variable.names
rownames(m) = variable.names
#return the final table and name it jaccards
jaccards = m
Visualize the results
A heatmap is an ideal way to visualize tables of coefficients like this. To create a heatmap for this data in Displayr,
- From the toolbar, click Visualization
> Heatmaps > Heatmap.
- Click onto the page to add the visualization.
- From the object inspector go to Data > Data Source > Data, and select the output for the Jaccard coefficients that was created above.
- Click Calculate.
You'll get a result that looks like the following. With the blue default color palette, the largest Jaccard coefficients will be the darkest blue. Looking for dark patches of the diagonal of the table allows you to locate the pairs of products that have the biggest overlap according to the Jaccard index. In this case, we see strong overlaps between iPhone, iPod, and iPad owners in the top left, and between Samsung owners and people who own non-Mac computers over to the right.
Next
How to Run a Linear Regression in Displayr
In this code:
- I have defined a function called Jaccard. The function takes any two variables and calculates the Jaccard coefficient for those two variables. A function is a set of instructions that can be used elsewhere in the code. Particularly for more complicated blocks of code, writing a function like this can make your code more efficient and easier to read and check for mistakes.
- input.variables contains a data frame that has each of the variables you want to analyze as the columns.
- Initially, I have created a matrix full of missing values as a place to store my calculations.
- I have used two for loops to go through and calculate the Jaccard coefficients and fill up the top half of the matrix.
- The bottom half of the matrix is left empty. In Q, missing values are displayed as empty cells. As the bottom half of the matrix would be identical to the top half, empty cells help us to read the results more easily.
- I have used the sapply function to obtain the labels for each variable so that they may be displayed in the row labels (rownames) and column labels (colnames) of the table. In this case, sapply is using the attr function to obtain the label attribute of each variable. As R does not recognize the same set of meta data for each variable, Q adds the meta data to the attributes of the variables so that it may be returned later if necessary.
The result is a table that contains all of the Jaccard coefficients for each pair of variables.
Visualize the results
A heatmap is an ideal way to visualize tables of coefficients like this. To create a heatmap for this data in Q,
- Select Create > Charts > Visualization > Heatmap.
- Under Inputs > DATA SOURCE, click into Output and select the output for the Jaccard coefficients that was created above.
- Tick Automatic.
I’ve shown an example of the resulting heatmap, below. With the blue default color palette, the largest Jaccard coefficients will be the darkest blue. Looking for dark patches of the diagonal allows you to identify the pairs of variables with the biggest overlap.
See Also
How R Works Differently in Q Compared to Other Programs
How to Use Different Types of Data in R
How to Reference Different Items in Your Project in R
How to Work with Conditional R Formulas
How to Add a Custom R Output to your Report
How to Create a Custom R Variable