Why do so many statisticians think a normality assumption is required in linear regression?

03 Jan 2024

In my time working in health science, I have been troubled by the number of times I have encountered statisticians and practitioners of statistics who are absolutely sure that either the variables or residuals in a linear regression must be approximately normally distributed, and the model is invalid otherwise.

This idea is completely false. In this post I want to explore why it is nonetheless so widely believed by professionals in the field.

Linear regression

When statisticians say a model is ‘valid’, they usually mean that the model plausibly satisfies some conditions that guarantee ‘nice’ results. By ‘nice’, it is usually meant that parameter estimates converge to the true values from the data generating process, and that the distribution of the parameter estimates is known (usually normal), as the sample size becomes large. There are several slightly different notions of convergence for random variables, but the most commonly used is convergence in probability. An estimator \( \hat{\beta} \) that converges in probability to true value \( \beta \) is called consistent.

In this post I will use the following notation: \( y \) is an \( n \)-dimensional vector whose elements are \( y_i \), \( X \) is an \( n \times k \) matrix whose rows are \( x_i \), \( \beta \) is a \( k\)-dimensional vector of parameters to be estimated, and \( \epsilon \) is an \( n \)-dimensional vector of residuals whose elements are \( \epsilon_i \).

In the linear regression setting, there are standard sets of assumptions that are sufficient to guarantee consistency. Note the plural - there are several. The first version you often learn while studying statistics is the following:

1.1. Linearity. The data-generating process is of the form \( y = X \beta + \epsilon \).

1.2. Strict exogeneity. \( \mathbb{E}[\epsilon \; | \; X] = 0 \).

1.3. No multicollinearity. The rank of \( X \) is \( k \) with probability \( 1 \).

1.4. Conditional normality of residuals. \( \epsilon \lvert_X \sim N(0, \sigma^2 I) \), where \( I \) is the \( n \times n \) identity matrix.

Note that an explicit normality assumption appears in 1.4.

However, several of these conditions can be weakened. For example, the following set of conditions is also sufficient:

2.1. Linearity. The data-generating process is of the form \( y = X \beta + \epsilon \).

2.2. Ergodic stationarity. The stochastic pocess \( (y_i, x_i) \) is stationary and ergodic.

2.3. Weak exogeneity. \( \mathbb{E}[\epsilon_i x_i] = 0 \; \forall i \).

2.4. Invertibility. \( \mathbb{E}[x_i x_i^T ] \) is invertible.

2.5. Central limit theorem (CLT). \( \sqrt{n} \left( \frac{1}{n} \sum_{i} x_i \epsilon_i\right) \xrightarrow{d} N(0, \Sigma) \), where \( \xrightarrow{d} \) denotes convergence in distribution.

Note here that the explicit normality assumption 1.4 is replaced by 2.5, which is much weaker and only requires \( \frac{1}{n} \sum_{i} x_i \epsilon_i \) to obey a central limit theorem. The most basic version of the CLT applies to independent and identically distributed random variables, but there are many other versions with weaker conditions.

In other words, there is no need to assume that the dependent variable, the independent variables or the residuals are normally distributed in linear regression. Why, then, do so many statisticians believe it to be so?

The first reason I can see is that it’s just easier. Conditions 1.1-1.4 are simpler to understand and require less mathematical knowledge than 2.1-2.5. In particular, you needn’t know anything about the central limit theorem, ergodic and stationary processes, and different notions of convergence for random variables.

The other reason I can see is that if you assume the residuals are normally distributed, then linear regression falls into the generalised linear class of models. This is an extremely useful set of statistical models that includes workhorses such as logistic and Poisson regression. Non-linear models in this family are fitted using maximum likelihood methods - usually the Newton-Raphson method, which results in an iteratively re-weighted least squares algorithm. It is conceptually and pedagogically neat to include all of these models under the same banner, because they have a great deal of commonality. However, because they are maximum likelihood models, this does require explicit distributional assumptions to be made - normality in the case of linear regression. Bundling linear regression into the generalised linear class obscures the fact that there are alternate, weaker assumptions that are sufficient for the model to be valid.