How to Automatically Remove Outliers from Regressions and GLMs

Sometimes your data may have extreme values or respondents who are outliers. These data points can make the fit of a regression model less accurate. You may already know there are extreme values in your data that need to be removed through the data cleaning process or you may have received a warning message about unusual observations after running your model:

You can remove outliers from your data before the model is fitted to increase the robustness of the model's accuracy. Sometimes removing outliers will make the warning message go away, but there may be confounding data or other data issues causing that warning to appear. This article describes how to go from a regression model that includes all respondents, including those who might be outliers:

To a state where outliers are automatically removed from the model (25%), thus affecting the key conclusions:

Requirements

A linear regression, binary logit, ordered logit, or another generalized linear model (GLM)

Method

Select your model output on the Page.
From the object inspector, go to Inputs and enter a number between 0-50 in the Automated outlier removal percentage field - this specifies the percentage of the data that is removed from analysis due to outliers. See below for more detail on how outliers are identified and what is included in the final results.
Click Calculate if the Automatic box is not already checked.

You will need to make judgments, trading off the following:

The more observations you remove, the less the model represents the entire data set. So start by removing a small percentage (e.g., 1%).
Does the warning disappear? If you can remove, say, 10% of the observations and the warning disappears, that may be a good thing. But, it is possible that you always get warnings. It's important to appreciate that the warnings are designed to alert to situations where rogue observations are potentially causing a detectable change in conclusions. But, often this change can be so small to be trivial.
How much do the key conclusions change? If they change a lot, you need to consider inspecting the raw data and working out why the observations are rogue (i.e., is there a data integrity issue?).

Technical Details

All regression types except for Multinomial Logit support this feature. If a zero-value is selected for this input control then no outlier removal is performed and a standard regression output for the entire (possibly filtered) dataset is applied. If a non-zero value is selected for this option then the regression model is fitted twice. The first regression model uses the entire dataset (after filters have been applied) and identifies the observations that generate the largest residuals. The user specified percent of cases in the data that have the largest residuals are then removed. The regression model is refitted on this reduced dataset and output returned. The specific residual used varies depending on the regression Type.

- Linear: The studentized residual in an unweighted regression and the Pearson residual in a weighted regression. The Pearson residual in the weighted case adjusts appropriately for the provided survey weights.
- Binary Logit and Ordered Logit: A type of surrogate residual from the sure R package (see Greenwell, McCarthy, Boehmke and Liu (2018) for more details). In Binary Logit it uses the resids function with the jitter parametrization. In Ordered Logit it uses the resids function with the latent parametrization to exploit the ordered logit structure.
- NBD Regression, Poisson Regression: A studentized deviance residual in an unweighted regression and the Pearson residual in a weighted regression.
- Quasi-Poisson Regression: A type of quasi-deviance residual via the rstudent function in an unweighted regression and the Pearson residual in a weighted regression.

When using Automated outlier removal percentage, the studentized residual computes the distance between the observed and fitted value for each point and standardizes (adjusts) based on the influence and an externally adjusted variance calculation . The studentized deviance residual computes the contribution the fitted point has to the likelihood and standardizes (adjusts) based on the influence of the point and an externally adjusted variance calculation (see rstudent function in R and Davison and Snell (1991) for more details). The Pearson residual in the weighted case computes the distance between the observed and fitted value and adjusts appropriately for the provided survey weights. See rstudent function in R and Davison and Snell (1991) for more details of the specifics of the calculations.

Articles in this section

Requirements

Method

Technical Details

See Also

Articles in this section

Requirements

Method

Technical Details

See Also

Related articles