Generalities
Ordinary Least Squares (OLS) is a regression algorithm for estimating the best linear predictor (BLP) \(f^*(X) = X\beta\) of a response \(Y \in \mathbb R\) in terms of a vector of regressors \(X\in \mathbb R ^p\), which we will frequently identify with an \(1\times p\) row matrix. Here, “best” is understood in terms of the \(L_2\) error:
\[ \beta = \arg \min _{\beta '} \mathbb E[(Y - X\beta ^\prime)^2]=\mathbb E (X^TX)^{-1}\mathbb E(X^T Y), \tag{1}\]
where the first equation is the defining one, while the second one follows from elementary calculus.
Given i.i.d. data \(\{(X_i,\,Y_i)\}_{i=1,\,2,\,\dots,\,N}\), and denoting by \(\mathbf Y\) and \(\mathbf X\) the \(N\times 1\) and \(N\times p\) matrices obtained by vertically stacking independent observations of \(Y\) and \(X\), respectively, the OLS estimate of Equation 1 is defined by:
\[ \hat \beta = \arg \min _{\beta '} \sum _{i=1} ^N \frac{1}{N}(Y_i - X_i\beta ^\prime)^2 = (\mathbf X ^T \mathbf X) ^{-1} \mathbf X ^T \mathbf Y, \tag{2}\]
which is readily recognized to be the plugin estimate of \(\beta\). Correspondingly, we define the OLS predictor:
\[ \hat Y (x) = x \hat \beta. \tag{3}\]
What motivates the use of an \(L_2\) criterion in Equation 1?
Mathematical tractability. The fact that Equation 1 admits a closed form solution, which is furthermore linear in the response variable \(Y\), greatly simplifies the analysis of the properties of \(\beta\) and its estimators such as the OLS one Equation 2, making it a perfect study case.
Numerical tractability. A consequence of the previous point, but worth a separate mention. Computing the plugin estimate in Equation 2 is just a matter of basic linear algebra manipulations, which, with modern software libraries, is a relatively cheap operation.
Normal theory. Focusing on the consequence of Equation 1, namely the plugin estimate Equation 2, if the conditional distribution of \(Y\) given \(X\) is normal with constant variance, and if \(\mathbb E(Y\vert X)\) is truly linear, \(\beta\) coincides with the maximum likelihood estimate of \(\beta\).
The linear model
The term generally refers to a model for the conditional distribution of \(Y\vert X\) that requires the conditional mean \(\mathbb E(Y\vert X)\) to be a linear function of \(X\). In its most parsimonious form, this is just:
\[ Y = X \beta + \varepsilon, \quad \mathbb E(\varepsilon\vert X) = 0. \tag{4}\]
That said, depending on context, Equation 4 is usually supplemented with additional assumptions that further characterise the conditional distribution of the error term1, typically (with increasing strength of assumptions):
- Constant Variance. \(\mathbb V(\varepsilon \vert X) = \sigma ^2\), independently of \(X\).
- \(X\)-Independent Errors. \(\varepsilon \perp X\).
- Normal Errors. \(\varepsilon \vert X \sim \mathcal N (0,\sigma ^2)\).
While the OLS estimator is well-defined irrespective of the validity of any of these models, it is clear that, in order for \(\hat \beta\) to represent a meaningful summary of the \(Y\)-\(X\) dependence, one should require at least Equation 4 to hold in some approximate sense. Correspondingly, while some general features of \(\hat \beta\) can be discussed independently of linear model assumptions, its most important properties crucially depend on Equation 4.
Properties
Algebraic properties of BLP and OLS estimates
Consider the BLP \(f^*(X) = X\beta\), with \(\beta\) defined as in Equation 1. We usually assume that the covariate vector \(X\) contains an intercept term, that is
\[ X=\begin{pmatrix}1 & Z\end{pmatrix},\quad Z\in \mathbb R^{p-1} \tag{5}\]
for some \(p-1\) dimensional random vector \(Z\). The presence of an intercept term leads to \(f^*(X)\) having a bunch of nice properties, such as being unconditionally unbiased (\(\mathbb E(Y-f^*(X))=0\)), as we show below.
Let us decompose \(\beta = \begin{pmatrix}a & b\end{pmatrix}\), where \(a\in \mathbb R\) is the intercept term, and \(b \in \mathbb R^{p-1}\) is the coefficient of \(Z\). We can easily prove that:
\[ \begin{split} b &= \mathbb V (Z)^{-1}\mathbb V(Z,Y),\\ a &= \mathbb E(Y)-\mathbb E(Z)b, \end{split} \tag{6}\]
where we denote by \(\mathbb V (A,B)\) the covariance matrix of \(A\) and \(B\), and by \(\mathbb V (A)\equiv \mathbb V (A,A)\). These expressions can be used to recast the error term as:
\[ Y-X\beta = (Y-\mathbb E(Y))-(Z-\mathbb E(Z))\mathbb V (Z)^{-1}\mathbb V(Z,Y). \tag{7}\]
From this expression we can easily find the first two moments:
\[ \begin{split} \mathbb E(Y-X\beta)&=0,\\ \mathbb E((Y-X\beta)^2)&=\mathbb V(Y)-\mathbb V(Z,Y)^T\mathbb V(Z)^{-1}\mathbb V(Z,Y), \end{split} \tag{8}\]
In particular, as anticipated the best linear predictor is unconditionally unbiased. More generally, from Equation 7 it follows one has:
\[ \mathbb E(X^T(Y-X\beta))=0 \tag{9}\] which, since \(\mathbb E(Y-X\beta) = 0\), can also be interpreted as saying that the error term \(Y-X\beta\) and the covariate vector \(X\) are uncorrelated.
These properties directly translate to corresponding properties of the empirical residuals \(\mathbf Y-\mathbf X\hat \beta\). Notice that, up to now no result depends on the particular probability measure to which expectations refer to. In particular, we can choose this measure to be the empirical distribution realized in a specific sample, which amounts to replace all expectations with sample means. Thus, Equation 9 translates to:
\[ \frac{1}{N}\mathbf X^T(\mathbf Y-\mathbf X\hat \beta)=0, \tag{10}\]
which implies in particular that sample residuals have vanishing sample means.
Distribution of coefficient estimates \(\hat \beta\)
As an estimator of \(\beta\) as defined in Equation 1, the OLS estimator \(\hat \beta\) is consistent (converges in probability to \(\beta\)) but generally biased (an example is provided here). However, due to the plugin nature of \(\hat \beta\), the bias is generally of order \(\mathcal O (N^{-1})\), which makes it often negligible in comparison to its \(\mathcal O(N^{-1/2})\) sampling variability (see below).
Explicitly, the bias is given by:
\[ \mathbb E(\hat \beta) - \beta = \mathbb E\lbrace((\mathbf X ^T \mathbf X) ^{-1}-\mathbb E[(\mathbf X ^T \mathbf X) ^{-1}])\cdot \mathbf X ^T f (\mathbf X)\rbrace, \tag{11}\]
where \(f(X) \equiv \mathbb E(Y \vert X)\) is the true conditional mean function. In general, this vanishes only if \(f(X)=X\beta\), as in the linear expectation model Equation 4.
The \(\mathbf X\)-conditional variance of \(\hat \beta\) can be derived directly from Equation 2:
\[ \mathbb V (\hat \beta \vert \mathbf X)=(\mathbf X ^T \mathbf X) ^{-1} \mathbf X ^T \mathbb V (\mathbf Y\vert \mathbf X) \mathbf X (\mathbf X ^T \mathbf X), \tag{12}\]
where \(\mathbb V (\mathbf Y\vert \mathbf X)\) is diagonal for i.i.d. observations. For homoskedastic errors we get:
\[ \mathbb V (\hat \beta \vert \mathbf X)=(\mathbf X ^T \mathbf X) ^{-1} \sigma ^2 \quad (\mathbb V(Y\vert X)=\sigma ^2). \tag{13}\]
Under the normal linear model, this allows to obtain finite-sample correct confidence sets for \(\beta\). In the general case, confidence sets can be derived from the Central Limit Theorem satisfied by \(\hat \beta\) (Buja et al. 2019):
\[ \sqrt N (\hat \beta -\beta) \to \mathcal N (0,V ) \tag{14}\]
where the asymptotic variance is given by:
\[ V = \mathbb E[X^TX] ^{-1} \cdot \mathbb E[X^T(Y-X\beta)^2X] \cdot \mathbb E[X^TX] ^{-1}. \tag{15}\]
The plugin estimate of Equation 15 leads to the so called Sandwich variance estimator:
\[ V_\text{sand} \equiv (\mathbf X^T \mathbf X)^{-1} \mathbf X^T D_\mathbf {r^2}\mathbf X (\mathbf X^T \mathbf X)^{-1}, \tag{16}\]
where \(D_{\mathbf r^2}\) is the diagonal matrix whose \(i\)-th entry is the squared residual \(r_i ^2 = (Y_i-X_i\beta)^2\).
Variance estimates
These can be based on:
\[ s ^2 = \frac{1}{N}(\mathbf Y-\mathbf X \hat \beta)^T(\mathbf Y-\mathbf X \hat \beta) \] which has expectation
\[ \mathbb E\left[s^2\right]=\frac{1}{N}\text {Tr}\,\mathbb E \left[ (1-\mathbf H) \cdot \left(\mathbb E(\mathbf Y \vert \mathbf X)\mathbb E(\mathbf Y \vert \mathbf X)^T + \mathbb V(\mathbf Y\vert\mathbf X)\right) \right] \tag{17}\]
where the hat matrix \(\mathbf H\) is defined as usual:
\[ \mathbf H \equiv \mathbf X(\mathbf X^T\mathbf X)^{-1}\mathbf X^T \tag{18}\]
If the general linear model Equation 4 holds, so that \(E(\mathbf Y \vert \mathbf X) = \mathbf X \beta\), we have \((1-\mathbf H) \mathbb E(\mathbf Y \vert \mathbf X) = 0\). If we furthermore assume homoskedasticity, we obtain:
\[ \mathbb E\left[s^2\right]=\frac{N-p}{N}\sigma^2 \quad(\text{Homoskedastic linear model}), {#eq-variance-estimate-homo} \]
where \(p = \text {Tr}(\mathbf H)\) is the number of independent covariates. On the other hand, if homoskedasticity holds, but \(E(\mathbf Y \vert \mathbf X)\) is not linear, the left-hand side of the previous equation is an overestimate of \(\mathbf V \vert X\).
Measurement error
Suppose that, rather than the variables of interest \(Y\) and \(X\), we can only observe noisy versions:
\[ \widetilde Y = Y+\delta_Y,\qquad \widetilde Z=Z+\delta _Z \tag{19}\]
where \(X=\begin{pmatrix}1 & Z\end{pmatrix}\) as in Equation 5. We are truly interested in \(\beta\) defined by Equation 1, but OLS will actually estimate \(\widetilde \beta =\begin{pmatrix}\widetilde a & \widetilde b\end{pmatrix}\), the coefficient of the BLP of \(\widetilde Y\) in terms of \(\widetilde X\). Focusing on the slope term \(\widetilde b\), a comparison with Equation 6 yields:
\[ \begin{split} \widetilde b -b&= \Delta_{\mathbb V(Z)^{-1}}\cdot\mathbb V(Z,Y) +\mathbb V(Z)^{-1} \cdot \Delta _{\mathbb V(Z,Y)}+\Delta_{\mathbb V(Z)^{-1}}\cdot\Delta _{\mathbb V(Z,Y)},\\ \Delta_{\mathbb V(Z)^{-1}}&=\mathbb V (\widetilde Z)^{-1}-V(Z)^{-1}, \\ \Delta _{\mathbb V(Z,Y)} &=\mathbb V(\widetilde Z,\widetilde Y)-\mathbb V(Z,Y). \end{split} \tag{20}\]
where, explicitly:
\[ \begin{split} \mathbb V(\widetilde Z)&=\mathbb V(Z)+\mathbb V(\delta _Z)+2\mathbb V(Z,\delta _Z),\\ \mathbb V(\widetilde Z,\widetilde Y)&=\mathbb V(Z,Y)+\mathbb V(\delta _Z,Y)+\mathbb V(\delta _Y,Z)+\mathbb V(\delta _Y,\delta _Z). \end{split} \tag{21}\]
The effect on the slope estimate generally depends on the correlation structure of signal and noise. In the special case of “totally random” noise, that is if \(\delta _Z\) and \(\delta _Y\) are independent of each other and of \(Z\) and \(Y\), we see that the only effect is an increase \(\mathbb V(\widetilde Z) = \mathbb V(Z)+ \mathbb V(\delta _Z)\), which shrinks the slope coefficient \(\widetilde b\) towards zero. In particular, if random noise is only present in the response (\(\delta _Z = 0\)), the estimation target is unaffected (although the variance of the \(\hat b\) estimator is, in general).
Proofs
Proof of Central Limit Theorem for \(\hat \beta\)
Convergence to the normal distribution Equation 14 with variance Equation 15 can be proved using the formalism of influence functions. From Equation 1, we see that a small variation \(P\to P+\delta P\) to the joint \(XY\) probability measure induces a first order shift:
\[ \delta \beta = \intop \mathbb E(X^TX)^{-1}X^T (Y-\beta X) \text d(\delta P) \tag{22}\]
in the best linear predictor. The influence function of \(\beta\) is defined by the measurable representation of \(\delta \beta\), namely:
\[ \phi _\beta = \mathbb E(X^TX)^{-1}X^T (Y-\beta X). \tag{23}\]
A general result for plugin estimates then tells us that \(\sqrt N (\hat \beta -\beta) \to \mathcal N (0, \mathbb E (\phi _\beta ^2))\) in distribution, and using the explicit form of Equation 23 we readily obtain Equation 15.
References
Footnotes
I use the notation \(A\perp B\) to indicate that the random variables \(A\) and \(B\) are statistically independent.↩︎
Reuse
Citation
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {Ordinary {Least} {Squares}},
date = {2024-02-07},
url = {https://vgherard.github.io/notebooks/ordinary-least-squares/},
langid = {en}
}