
How to check the condition of normal distribution of residuals for the linear regression model in R and SPSS?
Definition
In a linear regression, an attempt is made to model a linear dependence between a dependent variable y and one or more independent variables x. An important assumption here is that the residuals (the difference between the observed values of y and the predicted values of y) are normally distributed.
A normal distribution of the residuals is important because it allows statistical estimates and tests to be performed on the model parameters. If the residuals are normally distributed, one can assume that the estimates of the regression parameters (such as the slope and the y-intercept) are normally distributed.
If the residuals are not normally distributed, there may be problems in interpreting the statistical significance of the regression parameters because the standard errors and T-values used to calculate significance depend on the assumption of a normally distributed distribution of the residuals.
Therefore, one must ensure that the residuals are normally distributed before interpreting the statistical estimates.
The term prerequisite can be a bit confusing here, since most prerequisites can only be checked after the model has been estimated in the first place. One can therefore consider the preconditions rather as preconditions for the correct interpretation of the coefficients and less for performing the actual regression.
Methods
There are several methods to check the normal distribution of the residuals, such as examining Q-Q plots, calculating skewness and kurtosis, or performing normality tests such as the Shapiro-Wilk test. On the following lines this procedure is explained in more detail in R and SPSS.
Example in R
For this example, we again use the "Swiss" dataset in R we estimate a regression model by regressing birth rate (Fertility) on education (Education). We store the result in the variable "fit".
fit <- lm(Fertility ~ Education, swiss)
We can display the result using the summary function:
summary(fit)

We observe a negative relationship (-0.8624) between education and birth rate.
We get the residuals of a model via the attribute: $residuals.
residuen <- fit$coefficients
To check whether the residuals are normally distributed, we can first plot the residuals in a histogram.
hist(residuals)

At first glance, the residuals do not look normally distributed. However, since we have a relatively small sample, we cannot make a statement based only on the plot. We can therefore still perform a statistical test, in this case the "Shapiro-Wik test".
In R this is easily done with:
shapiro.test(residuen)

The null hypothesis of the Shapiro-Wilk test is that the residuals are normally distributed. Our test yielded a p-value of 0.0592, which is above the usual significance level of 5%. Thus, we cannot reject the null hypothesis that the residuals are normally distributed. The condition is therefore fulfilled.
Example in SPSS
The procedure in SPSS is analogous to that in R.
Estimate a regression by clicking on: "Analyze" -> "Regression" -> "Linear ... "
Before you click on Ok or Insert, click on the button "Diagrams" and select "Histogram" for diagrams of standardized residuals. Click OK and display the results.

To run the Shapiro-Wilk test with SPSS, we must first save the residuals of the regression. We also have a button in the regression window called "Save". Click on it and select "Not standardized" for the residuals. Then click on "Next" and then on "OK".

A new variable has now been created in your data set with the name "RES_1". We can now perform a Shapiro-Wilk test for this variable.

- Click Analyze -> Descriptive Statistics -> Exploratory Data Analysis in the menu.
- Select the newly created variable "Unstandardized Residual [RES_1]".
- Click on the button "Diagrams" and check the box "Normal distribution diagram with tests".
- Click on "Next" and then on "OK".
In the output you will now see the output for the tests for normal distribution.

Analogous to the test in R, we see that the Shapiro-Wilk test is just not significant with a p-value of 0.059. We cannot reject the null hypothesis and can therefore assume that the residuals are normally distributed. The condition is therefore fulfilled.
What to do if the residuals are not normally distributed?
In practice, however, it is not always possible to fulfill the requirement of normally distributed residuals. In such cases, one should check whether there is a linear relationship between the two variables at all. If this is not the case, the relationship between the variables can be better modeled with another, non-linear function. If a linear relationship is assumed, then bootstrapping can be used with smaller samples and thus simulate the standard errors instead of estimating them. With larger samples, the assumption of normally distributed residuals can also be violated since the central limit theorem applies. Different papers have set different thresholds such as N > 15 or N > 50. The more symmetrically distributed the residuals are, the fewer observations are needed for the central limit theorem to apply. In case of doubt, however, bootstrapping can also be performed on somewhat larger samples to avoid any doubts or objections to the results.



