This article describes how to do simple web scraping in Q using R.
Web scraping (also referred to as web data extraction or web harvesting), is the process of using software to fetch the contents of a web page and extract information from it for use in some analysis. In Q, you may want to add a visualization or analysis of some web-based data to your report to supplement the results from your survey. In this article, I show you how to use the rvest package in R to bring in some data from a web page, and then connect that data to a visualization.
In this article, I will extract some data stored on a Wikipedia page. The data shows the top companies by revenue, and it is contained in a table on the Wiki page.
Obtaining the data
In this example we will add the table of data as an R Output:
- Select Create > R Output.
- Paste the below code into the R CODE field.
- Click Calculate.
# Reading in the table from wikipedia
page = read_html("https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue")
# Obtain the peice of the web page that corresponds to the "wikitable" node
my.table = html_node(page, ".wikitable")
# Convert the html table element into a tibble
my.table = html_table(my.table, fill = TRUE)
# Move the company names into the rows of the table
# Remove rows not needed
my.table = my.table[-c(1),]
# Remove columns not needed
my.table = my.table[, -c(6,7)]
# Convert employees into numeric values
my.table[, 5] = parse_number(my.table[, 5])
This code produces a table containing the data from Wikipedia.
In this code I have:
- Loaded the rvest package, which has functions for dealing with web pages (and also introduces functions from the package xml2 that are also handy for processing html).
- Used the function read_html to obtain the html for the web page.
- Used the function html_node to obtain the part of the web page that corresponds to the table element (called .wikitable)
- Used the function html_table to convert the html table into a data frame.
- Used column_to_rownames from the tibble package to move the company names from the body of the table into the row names. This makes for easier labeling in visualizations later on.
- Removed columns for the company names, headquarters, and reference (“ref”).
- Used the parse_number function from the readr package to convert the comma-formatted numbers into true numeric values.
We can visualize parts of this table really easily with the options in Create > Charts > Visualization. I’ll use a bar chart, and I will use the options to select, sort, and show the parts of the table that I am most interested in. The steps are as follows:
- Select Create > Charts > Visualization > Bar Chart.
- Click on the Outputs box under Inputs > DATA SOURCE in the Object Inspector on the right, and select the table created above (my.table).
- Click into Inputs > COLUMN MANIPULATIONS > Columns to show and type Employees. This means my bar chart will show the number of employees for each company, and it won’t try to plot the other columns from the table.
- Tick Inputs > ROW MANIPULATIONS > Sort rows and Sort in decreasing order. This automatically sorts the companies according to the number of employees.
- Change Inputs > ROW MANIPULATIONS > Number of rows from top to show to 20. This lets me present the top 20 companies according to the number of employees.
- Tick Automatic. This ensures any changes in the data or the chart will be reflected automatically.
I’ve also changed fonts, labels, and axis properties in the Chart section of the Object Inspector. My chart looks like this: