How to Automatically Categorize Unstructured Text Data

Text data is one of the great pains of survey analysis. Open-ended questions allow us to obtain data that is less biased by our preconceptions. But it takes so long to read and interpret it! In this post, I describe two strategies for analyzing such data in Q: A quick-and-dirty approach and a quite-fast-and-awesome approach.

This article tells you how to automatically code unstructured text data. It will take you from unstructured verbatim (raw) text responses:

To a state where the verbatims are automatically categorized:

Requirements

You will need a Text question in order to perform automatic coding. Text questions are identified with variable type in the Variables and Questions tab

Method 1: Fully automated text analysis (quick and dirty)

Go to Create > Text Analysis > Automatic Categorization > Unstructured Text.
In the object inspector (the section that opens on the right of the screen), under Inputs > Text variable, select the variable that holds the text you want to analyze.
Change the Inputs > Number of categories to the number of categories you would like to classify the data into. I've chosen 15 for this example.
The output will calculate automatically, and looks like this:

The category names on the left are something of a potluck. In this output, you can see:

The automatically generated categories,
The center column contains the proportions,
Counts of the number of cases in parentheses,
Examples of the types of responses that have been allocated.

We haven’t yet cracked an algorithm that reliably gives human-understandable names. So, the secret is very much to look at the example and to expand out the examples (clicking the grey triangle ) to see all the data.

From our experience and reports from our clients, fully automatic text analysis can often give good insight. But it is not as good as doing manual coding. This is where the next approach comes in handy.

Method 2: Automatic updating of text analyses (wide and awesome)

The gold-plated approach to efficiently doing automated text analyses is as follows:

Manually or semi-automatically perform an analysis of, say, 300 text responses.
Create > Text Analysis > Automatic Categorization > Unstructured Text.
In the Object Inspector, set Existing categorization to the variable set that contains the manual or semi-automatic categorization.

Q will then train a machine learning model using the existing categorization and predict the categories for the remaining text, often with extremely high accuracy.

Fully automated categorization (the first part of this article) only forms mutually exclusive categorizations. However, when you use it for automatic updating, as per this section, it also works for overlapping (multiple response) categorizations.

How to Save the Categories to your Data Set

You can easily save the categories assigned to your data, so that you can use them in other analyses. Make sure that the output above is selected, then go to Create > Text Analysis > Advanced > Save Variable(s) > Categories. A new variable appears in your Variables and Questions tab called “Categories from…”

To create a simple example of a table that uses categorized data and another variable, use the blue and brown dropdowns. In the results below, I also crossed the automatically generated categories with the education level of the respondents (by placing it in the columns of the table).

Additional details when using an existing category (Method 2)

There's no hard and fast rule on how much text data should be categorized manually or semi-automatically first. It comes down to the specifics of the text that's being categorized and how much the responses vary. For example, a brand list will be much simpler to automatically categorize into a list of brands as the text responses vary less and the categories are more straightforward. Full, unstructured text responses will require more cases, particularly if the responses are detailed or vary more between cases.

In general, you can use these rules of thumb:

Every category that should be present in the final automatic categorization must be included in the existing manual/semi-automatic categorization.
Each category should have a representative amount of data categorized into it. What's "representative" can be tough to pin down precisely, but the responses included in each category should represent the breadth of responses in the original text data for that category
Very roughly, this means about 20-30% of the text responses should be categorized, this will vary from categorization to categorization.

Using an existing categorization to train the automatic categorization tool is very often an iterative process. Tuning the tool may likely take a few rounds of first manually/semi-automatically categorizing, then running the automatic process, and checking the results. This is essentially training the model for text data that the tool has never seen before, so it may involve a bit of back and forth.

In the event that not all of the responses may fit into one of the existing categories based on the machine learning. In this instance, the response will not be categorized. In order for the algorithm to train itself on the existing categorization, it has to compare text that has and has not been coded into a category, so there needs to be at least 1 but not all responses coded into each category.

To capture those responses, first examine which responses haven't been categorized. Then, refine the existing categorization either by adding new categories with a representative sample of the text responses categorized into them, categorizing more text responses into the existing categories to train the automatic categorization with a more diverse data set, or a combination of both. Though there's not really a blanket threshold, do know that the more you can categorize into a category, the better the model will be able to fit the data and "predict" what uncategorized text should go into it.

Refining the existing categorization with a wider range of text responses and re-running the automatic categorization can often be an iterative process as you train the model to categorize all of the responses correctly.

The approach described above uses what we call unstructured text. Sometimes text data can have much more structure. We have two other tools designed for automatically categorizing such data:

How to Do Automatic List Categorization of Text Data with Q

How to Automatically Extract Entities and Sentiment from Text

Articles in this section

Requirements

Method 1: Fully automated text analysis (quick and dirty)

Method 2: Automatic updating of text analyses (wide and awesome)

How to Save the Categories to your Data Set

Additional details when using an existing category (Method 2)

Next

Articles in this section

Requirements

Method 1: Fully automated text analysis (quick and dirty)

Method 2: Automatic updating of text analyses (wide and awesome)

How to Save the Categories to your Data Set

Additional details when using an existing category (Method 2)

Next

Related articles