How to Work with Conditional R Formulas

Requirements

An R variable, R output or R data set.
The main condition operators are as follows:

Method

1. Boolean expression

A Boolean expression is an expression which evaluates to a logical value of true or false. You can test single values or multiple values (value by value) against one another. Results that are true are returned as values of 1 and false as 0 (i.e., this is a way to construct a binary variable).

For example, we have two numeric variables, v1 and v2:

v1 != v2 returns a 1 for observations where v1 and v2 differ and a value of 0 when they are the same.
is.na(v1) returns a 1 for observations in v1 that have the value of NA (missing), otherwise it returns a 0.
rowSums(v1,v2) > 0 returns a 1 if the sum of v1 and v2 is greater than 0, otherwise it returns a 0.

Two other useful functions to use alongside conditional formulas are any() and all(). For example:

all(v1 == v2) returns a single TRUE value if all the values in the corresponding rows of both variables are equal to each other.
any(v1 == v2) returns a single TRUE if any of the values in the corresponding rows of both variables are equal to each other.

2. The ifelse method

There is a shortcut method called ifelse that lets you write a condition in a single line. This is useful when creating R filter variables as it will perform the logic on each response in the variable used in the condition.

For example, if you want to mark an Other open end as missing unless the respondent selected the Other option in the corresponding Pick Any.

ifelse(Q2_Other == 1, Q2_Other_TEXT, NA)

The formula below will return a Yes if x is greater than 1, otherwise a No:

ifelse(x>1,"Yes","No")

Note, this returns a value for each record in your x object. You can also nest this to additionally return Maybe if y is greater than 1:

ifelse(x>1,"Yes", ifelse(y>1,"Maybe","No"))

3. The if...else method

Another option that uses conditional statements is the if then else structure, which sets up a plan for how to run your code. This if-else structure compares only the first value of variables used in the conditions and not all values. Thus these are NOT how if-else structures are processed in JavaScript code where each response in the data set is checked to recode a variable or something else. To recode variables using R please see How to Create a New Variable Based on Other Variables using R and How to Recode Data Based on a Lookup Using R.

For example, if the filter applied (QFilter) is too small show an error message else show the intended output.

if(sum(QFilter) < 20) "Too little data" else
sort(summary(Age[QFilter]),decreasing=T)

This is the same set of conditions using optional curly brackets and spacing:

if(sum(QFilter) < 20){
 "Too little data"
} else {
 sort(summary(Age[QFilter]),decreasing=T)
}

4. The subscripting method

A further conditional method which is useful for banding variables is to essentially apply filter conditions. Again using the same example, we can write the following:

x[x>=4] = 1
x[x==3] = 2
x[x<=2] = 3
x

Note, this returns a value for each record in your x object.

Alternatively, you can replace the values with labels so it returns a text output instead:

x[x>=4] = "Yes"
x[x==3] = "Maybe"
x[x<=2] = "No"
x

Note, Q will either create Number or Text variables, but you can subsequently change the Question Type to a categorical Pick One or whatever you like after you save the variable for the first time.

5. The switch method

An alternative to if...else is the switch function. We can write the following to return a different code for x:

switch(x,3,3,2,1,1)

In this code, the value of x represents an index which tells it which subsequent value to return. So if x equals 4, it will return 1 as this is the fourth of the five recode values.

Note, this returns a single value only.

6. The case_when method

The dplyr R package offers the case_when function which is particularly useful for working with categorical data. Below is an example of how to recode an Age variable into groups:

dplyr::case_when(
Age == "18 to 24" ~ 1,
Age == "25 to 29" ~ 2,
Age %in% c("40 to 44", "45 to 49") ~ 3,
Age %in% c("50 to 54", "55 to 64", "65 or more") ~ 4
TRUE ~ 0
)

Looking at the code above, note that:

For a single category, we use the == operator.
For multiple categories, we list them surrounded by c() and use the %in% operator.
The values are assigned at the end of the line, after a ~.
The TRUE ~ 0 is optional and R reads this as assign 0 to "everybody else". If records don't fall into any of these conditions and this line is omitted, the result will return NA.

Let's now look at a more complex example that references multiple questions, Age and d4 (living arrangements). Here, we wish to create a household structure variable by using the & operator:

dplyr::case_when(
# Young singles
Age %in% c("18 to 24", "25 to 29", "30 to 34") & 
d4 %in% c("Living alone", "Sharing accommodation") ~ 1,
# Older singles
!Age %in% c("18 to 24", "25 to 29", "30 to 34") & 
d4 %in% c("Living alone", "Sharing accommodation") ~ 2,
# Young couples
Age %in% c("18 to 24", "25 to 29", "30 to 34") & 
d4 == "Living with partner only" ~ 3,
# Older couples
!Age %in% c("18 to 24", "25 to 29", "30 to 34") & 
d4 == "Living with partner only" ~ 4,
# Young families
Age %in% c("18 to 24", "25 to 29", "30 to 34") & 
d4 %in% c("Living with partner and children", "Living with children only") ~ 5,
# Older families
!Age %in% c("18 to 24", "25 to 29", "30 to 34") & 
d4 %in% c("Living with partner and children", "Living with children only") ~ 6,
# Older families
TRUE ~ 7
)

A much nicer way of computing a household structure variable is shown in the code below:

young = Age %in% c("18 to 24", "25 to 29", "30 to 34")
single = d4 %in% c("Living alone", "Sharing accommodation")
partner.only = d4 == "Living with partner only"
children = d4 %in% c("Living with partner and children", "Living with children only")

dplyr::case_when(
young & single ~ 1,
!young & single ~ 2,
young & partner.only ~ 3,
!young & partner.only ~ 4,
young & children ~ 5,
!young & children ~ 6,
!children & !partner.only & !single ~ 7
)

This approach initially creates four variables as inputs to the main variable of interest. These variables are so called scratch variables: they're only accessible to this specific code, and not from any other object or code in Q. They exist for the sole purpose of computing household structure. This time the first 4 lines each compute a variable with TRUE or FALSE for each row of data, and then case_when evaluates these using standard boolean logic for each row of data.

Note, be careful of using as.numeric to convert categorical data into numeric data to avoid referencing value labels in your code. These assigned values will not necessarily match the values that have been set in the raw data file. For example, if the data file contains values of 1 for Male and 2 for Female, but no respondent selected male, then the value of 1 would be assigned to Female.

In these cases, it's better to create a numeric copy of your variable to reference instead. You can do this by right clicking on your variable on the Variables and Questions tab and select Copy and Paste Question(s) > Exact Copy. Then changing Question Type to Number.

How to Learn R

How to Use R in Q

How R Works Differently in Q Compared to Other Programs

How to Use Different Types of Data in R

How to Work with Data in R

How to Work with R Loops

How to Reference Different Items in Your Project in R

How to Use R Code to Create a Filter Based on Single-Response Questions

How to Create a New Variable Based on Other Variables using R

How to Filter Raw Data Using R

How to Recode Data Based on a Lookup Using R