This article describes how to create a scatter plot visualization that displays the values of two different variables as points. The data for each point is represented by its horizontal (x) and vertical (y) positions on the visualization.
A scatter plot displays data for a set of variables (columns in a table), where each row of the table is represented by a point in the scatter plot. The variables can be both categorical and numeric.
Method 1 - Default scatter plot
- From the toolbar, go to Create > Charts > Visualization> Scatter > Scatter.
- From the object inspector, go to Inputs > DATA SOURCE and select either an existing output (such as a table or R output), or select the variables to represent the X coordinates and Y coordinates.
- Click Calculate.
OPTIONAL: You can customize the sizes and colors. In the example above, I have selected Input > DATA
SOURCE > Labels and selected an additional variable to label the points.
Method 2 - Labeled scatter plot
- From the toolbar, go to Create > Charts > Visualization> Scatter > Labeled Scatter.
- From the object inspector, go to Inputs > DATA SOURCE and select either an existing output (such as a table or R output) or select the variables to represent the X coordinates and Y coordinates.
- Click Calculate.
The following is an explanation of the options available in the object inspector for this specific visualization.
Inputs > DATA SOURCE
Scatter plots accept tables supplied using either Paste or type data. These are expected to be tables where each row of the input data is shown as a separate point. The first two columns control the x and y coordinates, respectively. If provided, the values of the third column controls the sizes, and the fourth column controls the colors of the points. Additional columns in the table can be referred to for use with annotations. When the input table contains rownames, these will be used as the data labels. If multiple tables are selected, each one is expected to be in the same format as described above, but row names and column names must be the same across all tables. Note that the default format of the input data for scatter plots is different from other visualizations and Row/Column manipulations may not behave as expected. In these cases, you may want to select Input data contains y-values in multiple columns.
Alternatively, the user can assign X coordinates, Y coordinates, Sizes and Colors to be variables or outputs. This option is more flexible because each of these 4 components can be separately assigned instead of being extracted from the same table. However, it is also more complicated because the behavior may change slightly depending on the inputs chosen.
- Inputs are variables. This is the simplest use case; a marker is shown for each entry in the variables (i.e the variables are expected to be the same length).
- Inputs are tables. In this case, if the tables are simple 1-column tables, then they will behave exactly the same as the variable. However, where they have additional attributes, the chart will attempt to use these as well. If the tables have row labels, these will be used as the labels to the data points. It is also possible to explicitly use the row labels as X coordinates by selecting Use category labels instead of values. In the case where this is selected and a banner is used, the span labels are used instead of the row labels. If the Y coordinates is a 2-dimensional table, then the columns will be treated as separate data series (i.e. in different colors). If the Y coordinates table contains multiple statistics, then these may be used in the annotations.
- X or Y coordinates are a Standard R Regression model output. In this case either the regression coefficients or the importance scores are used as the data input. This is useful in particular for creating Scatterplots from a Driver Analysis output.
Input data contains y-values in multiple columns. When this is selected, each cell in the input table is shown as a separate point. The values in the table are used as the y-coordinates, whereas the x-coordinates is taken from the row labels. Each column is shown as a separate group, with the colors of the groups controlled by the color palette (under Data series). All points will be shown with the same size. If the table contains multiple statistics, these can be used to add annotations to the chart.