vgherard: Ordinary Least Squares

Valerio Gherardi

1 Generalities

Ordinary Least Squares (OLS) is a regression algorithm for estimating the best linear predictor (BLP) \(f^*(X) = X\beta\) of a response \(Y \in \mathbb R\) in terms of a vector of regressors \(X\in \mathbb R ^p\), which we will frequently identify with an \(1\times p\) row matrix. Here, “best” is understood in terms of the \(L_2\) error:

\[ \beta = \arg \min _{\beta '} \mathbb E[(Y - X\beta ^\prime)^2]=\mathbb E (X^TX)^{-1}\mathbb E(X^T Y), \tag{1} \] where the first equation is the defining one, while the second one follows from elementary calculus.

Given i.i.d. data \(\{(X_i,\,Y_i)\}_{i=1,\,2,\,\dots,\,N}\), and denoting by \(\mathbf Y\) and \(\mathbf X\) the \(N\times 1\) and \(N\times p\) matrices obtained by vertically stacking independent observations of \(Y\) and \(X\), respectively, the OLS estimate of (1) is defined by:

\[ \hat \beta = \arg \min _{\beta '} \sum _{i=1} ^N \frac{1}{N}(Y_i - X_i\beta ^\prime)^2 = (\mathbf X ^T \mathbf X) ^{-1} \mathbf X ^T \mathbf Y, \tag{2} \] which is readily recognized to be the plugin estimate of \(\beta\). Correspondingly, we define the OLS predictor:

\[ \hat Y (x) = x \hat \beta. \tag{3} \]

What motivates the use of an \(L_2\) criterion in (1)?

Mathematical tractability. The fact that (1) admits a closed form solution, which is furthermore linear in the response variable \(Y\), greatly simplifies the analysis of the properties of \(\beta\) and its estimators such as the OLS one (2), making it a perfect study case.
Numerical tractability. A consequence of the previous point, but worth a separate mention. Computing the plugin estimate in (2) is just a matter of basic linear algebra manipulations, which, with modern software libraries, is a relatively cheap operation.
Normal theory. Focusing on the consequence of Eq. (1), namely the plugin estimate (2), if the conditional distribution of \(Y\) given \(X\) is normal with constant variance, and if \(\mathbb E(Y\vert X)\) is truly linear, \(\beta\) coincides with the maximum likelihood estimate of \(\beta\).

1.1 The linear model

The term generally refers to a model for the conditional distribution of \(Y\vert X\) that requires the conditional mean \(\mathbb E(Y\vert X)\) to be a linear function of \(X\). In its most parsimonious form, this is just:

\[ Y = X \beta + \varepsilon, \quad \mathbb E(\varepsilon\vert X) = 0.\tag{4} \] That said, depending on context, (4) is usually supplemented with additional assumptions that further characterise the conditional distribution of the error term¹, typically (with increasing strength of assumptions):

Constant Variance. \(\mathbb V(\varepsilon \vert X) = \sigma ^2\), independently of \(X\).
\(X\)-Independent Errors. \(\varepsilon \perp X\).
Normal Errors. \(\varepsilon \vert X \sim \mathcal N (0,\sigma ^2)\).

While the OLS estimator is well-defined irrespective of the validity of any of these models, it is clear that, in order for \(\hat \beta\) to represent a meaningful summary of the \(Y\)-\(X\) dependence, one should require at least (4) to hold in some approximate sense. Correspondingly, while some general features of \(\hat \beta\) can be discussed independently of linear model assumptions, its most important properties crucially depend on Eq. (4).

2 Properties

2.1 Algebraic properties of BLP and OLS estimates

Consider the BLP \(f^*(X) = X\beta\), with \(\beta\) defined as in Eq. (1). We usually assume that the covariate vector \(X\) contains an intercept term, that is \[ X=\begin{pmatrix}1 & Z\end{pmatrix},\quad Z\in \mathbb R^{p-1} \tag{5} \] for some \(p-1\) dimensional random vector \(Z\). The presence of an intercept term leads to \(f^*(X)\) having a bunch of nice properties, such as being unconditionally unbiased (\(\mathbb E(Y-f^*(X))=0\)), as we show below.

Let us decompose \(\beta = \begin{pmatrix}a & b\end{pmatrix}\), where \(a\in \mathbb R\) is the intercept term, and \(b \in \mathbb R^{p-1}\) is the coefficient of \(Z\). We can easily prove that:

\[ \begin{split} b &= \mathbb V (Z)^{-1}\mathbb V(Z,Y),\\ a &= \mathbb E(Y)-\mathbb E(Z)b, \end{split}\tag{6} \] where we denote by \(\mathbb V (A,B)\) the covariance matrix of \(A\) and \(B\), and by \(\mathbb V (A)\equiv \mathbb V (A,A)\). These expressions can be used to recast the error term as:

\[ Y-X\beta = (Y-\mathbb E(Y))-(Z-\mathbb E(Z))\mathbb V (Z)^{-1}\mathbb V(Z,Y)\tag{7}. \]

From this expression we can easily find the first two moments:

\[ \begin{split} \mathbb E(Y-X\beta)&=0,\\ \mathbb E((Y-X\beta)^2)&=\mathbb V(Y)-\mathbb V(Z,Y)^T\mathbb V(Z)^{-1}\mathbb V(Z,Y), \end{split}\tag{8} \]

In particular, as anticipated the best linear predictor is unconditionally unbiased. More generally, from Eq. (7) it follows one has:

\[ \mathbb E(X^T(Y-X\beta))=0\tag{9}, \] which, since \(\mathbb E(Y-X\beta) = 0\), can also be interpreted as saying that the error term \(Y-X\beta\) and the covariate vector \(X\) are uncorrelated.

These properties directly translate to corresponding properties of the empirical residuals \(\mathbf Y-\mathbf X\hat \beta\). Notice that, up to now no result depends on the particular probability measure to which expectations refer to. In particular, we can choose this measure to be the empirical distribution realized in a specific sample, which amounts to replace all expectations with sample means. Thus, Eq. (9) translates to:

\[ \frac{1}{N}\mathbf X^T(\mathbf Y-\mathbf X\hat \beta)=0\tag{10}, \] which implies in particular that sample residuals have vanishing sample means.

2.2 Distribution of coefficient estimates \(\hat \beta\)

As an estimator of \(\beta\) as defined in Eq. (1), the OLS estimator \(\hat \beta\) is consistent (converges in probability to \(\beta\)) but generally biased (an example is provided here). However, due to the plugin nature of \(\hat \beta\), the bias is generally of order \(\mathcal O (N^{-1})\), which makes it often negligible in comparison to its \(\mathcal O(N^{-1/2})\) sampling variability (see below).

Explicitly, the bias is given by: \[ \mathbb E(\hat \beta) - \beta = \mathbb E\lbrace((\mathbf X ^T \mathbf X) ^{-1}-\mathbb E[(\mathbf X ^T \mathbf X) ^{-1}])\cdot \mathbf X ^T f (\mathbf X)\rbrace, \tag{11} \] where \(f(X) \equiv \mathbb E(Y \vert X)\) is the true conditional mean function. In general, this vanishes only if \(f(X)=X\beta\), as in the linear expectation model (4).

The \(\mathbf X\)-conditional variance of \(\hat \beta\) can be derived directly from Eq. (2):

\[ \mathbb V (\hat \beta \vert \mathbf X)=(\mathbf X ^T \mathbf X) ^{-1} \mathbf X ^T \mathbb V (\mathbf Y\vert \mathbf X) \mathbf X (\mathbf X ^T \mathbf X), \tag{12} \] where \(\mathbb V (\mathbf Y\vert \mathbf X)\) is diagonal for i.i.d. observations. For homoskedastic errors we get:

\[ \mathbb V (\hat \beta \vert \mathbf X)=(\mathbf X ^T \mathbf X) ^{-1} \sigma ^2 \quad (\mathbb V(Y\vert X)=\sigma ^2).\ \tag{13} \] Under the normal linear model, this allows to obtain finite-sample correct confidence sets for \(\beta\). In the general case, confidence sets can be derived from the Central Limit Theorem satisfied by \(\hat \beta\) (Buja et al. 2019):

\[ \sqrt N (\hat \beta -\beta) \to \mathcal N (0,V ) \tag{14} \] where the asymptotic variance is given by:

\[ V = \mathbb E[X^TX] ^{-1} \cdot \mathbb E[X^T(Y-X\beta)^2X] \cdot \mathbb E[X^TX] ^{-1}.\tag{15} \] The plugin estimate of (15) leads to the so called Sandwich variance estimator:

\[ V_\text{sand} \equiv (\mathbf X^T \mathbf X)^{-1} \mathbf X^T D_\mathbf {r^2}\mathbf X (\mathbf X^T \mathbf X)^{-1},\tag{16} \] where \(D_{\mathbf r^2}\) is the diagonal matrix whose \(i\)-th entry is the squared residual \(r_i ^2 = (Y_i-X_i\beta)^2\).

2.3 Variance estimates

These can be based on:

\[ s ^2 = \frac{1}{N}(\mathbf Y-\mathbf X \hat \beta)^T(\mathbf Y-\mathbf X \hat \beta) \] which has expectation

\[ \mathbb E\left[s^2\right]=\frac{1}{N}\text {Tr}\,\mathbb E \left[ (1-\mathbf H) \cdot \left(\mathbb E(\mathbf Y \vert \mathbf X)\mathbb E(\mathbf Y \vert \mathbf X)^T + \mathbb V(\mathbf Y\vert\mathbf X)\right) \right] \tag{17} \]

where the hat matrix \(\mathbf H\) is defined as usual:

\[ \mathbf H \equiv \mathbf X(\mathbf X^T\mathbf X)^{-1}\mathbf X^T\tag{18} \]

If the general linear model (4) holds, so that \(E(\mathbf Y \vert \mathbf X) = \mathbf X \beta\), we have \((1-\mathbf H) \mathbb E(\mathbf Y \vert \mathbf X) = 0\). If we furthermore assume homoskedasticity, we obtain:

\[ \mathbb E\left[s^2\right]=\frac{N-p}{N}\sigma^2 \quad(\text{Homoskedastic linear model}), \tag{19} \] where \(p = \text {Tr}(\mathbf H)\) is the number of independent covariates. On the other hand, if homoskedasticity holds, but \(E(\mathbf Y \vert \mathbf X)\) is not linear, the left-hand side of the previous equation is an overestimate of \(\mathbf V \vert X\).

3 Measurement error

Suppose that, rather than the variables of interest \(Y\) and \(X\), we can only observe noisy versions:

\[ \widetilde Y = Y+\delta_Y,\qquad \widetilde Z=Z+\delta _Z\tag{20} \]

where \(X=\begin{pmatrix}1 & Z\end{pmatrix}\) as in Eq. (5). We are truly interested in \(\beta\) defined by Eq. (1), but OLS will actually estimate \(\widetilde \beta =\begin{pmatrix}\widetilde a & \widetilde b\end{pmatrix}\), the coefficient of the BLP of \(\widetilde Y\) in terms of \(\widetilde X\). Focusing on the slope term \(\widetilde b\), a comparison with Eq. (6) yields:

\[ \begin{split} \widetilde b -b&= \Delta_{\mathbb V(Z)^{-1}}\cdot\mathbb V(Z,Y) +\mathbb V(Z)^{-1} \cdot \Delta _{\mathbb V(Z,Y)}+\Delta_{\mathbb V(Z)^{-1}}\cdot\Delta _{\mathbb V(Z,Y)},\\ \Delta_{\mathbb V(Z)^{-1}}&=\mathbb V (\widetilde Z)^{-1}-V(Z)^{-1}, \\ \Delta _{\mathbb V(Z,Y)} &=\mathbb V(\widetilde Z,\widetilde Y)-\mathbb V(Z,Y). \end{split}\tag{21} \]

where, explicitly:

\[ \begin{split} \mathbb V(\widetilde Z)&=\mathbb V(Z)+\mathbb V(\delta _Z)+2\mathbb V(Z,\delta _Z),\\ \mathbb V(\widetilde Z,\widetilde Y)&=\mathbb V(Z,Y)+\mathbb V(\delta _Z,Y)+\mathbb V(\delta _Y,Z)+\mathbb V(\delta _Y,\delta _Z). \end{split}\tag{22} \]

The effect on the slope estimate generally depends on the correlation structure of signal and noise. In the special case of “totally random” noise, that is if \(\delta _Z\) and \(\delta _Y\) are independent of each other and of \(Z\) and \(Y\), we see that the only effect is an increase \(\mathbb V(\widetilde Z) = \mathbb V(Z)+ \mathbb V(\delta _Z)\), which shrinks the slope coefficient \(\widetilde b\) towards zero. In particular, if random noise is only present in the response (\(\delta _Z = 0\)), the estimation target is unaffected (although the variance of the \(\hat b\) estimator is, in general).

4 Proofs

4.1 Proof of Central Limit Theorem for \(\hat \beta\)

Convergence to the normal distribution (14) with variance (15) can be proved using the formalism of influence functions. From Eq. (1), we see that a small variation \(P\to P+\delta P\) to the joint \(XY\) probability measure induces a first order shift:

\[ \delta \beta = \intop \mathbb E(X^TX)^{-1}X^T (Y-\beta X) \text d(\delta P)\tag{23} \] in the best linear predictor. The influence function of \(\beta\) is defined by the measurable representation of \(\delta \beta\), namely:

\[ \phi _\beta = \mathbb E(X^TX)^{-1}X^T (Y-\beta X). \tag{24} \] A general result for plugin estimates then tells us that \(\sqrt N (\hat \beta -\beta) \to \mathcal N (0, \mathbb E (\phi _\beta ^2))\) in distribution, and using the explicit form of (24) we readily obtain (15).

Buja, Andreas, Richard Berk, Lawrence Brown, Edward George, Emil Pitkin, Mikhail Traskin, Linda Zhao, and Kai Zhang. 2019. “Models as Approximations i: Consequences Illustrated with Linear Regression.” https://arxiv.org/abs/1404.1578.

I use the notation \(A\perp B\) to indicate that the random variables \(A\) and \(B\) are statistically independent.↩︎

Ordinary Least Squares

1 Generalities

1.1 The linear model

2 Properties

2.1 Algebraic properties of BLP and OLS estimates

2.2 Distribution of coefficient estimates \(\hat \beta\)

2.3 Variance estimates

3 Measurement error

4 Proofs

4.1 Proof of Central Limit Theorem for \(\hat \beta\)

References

Corrections

Reuse

Citation

Ordinary Least Squares

1 Generalities

1.1 The linear model

2 Properties

2.1 Algebraic properties of BLP and OLS estimates

2.2 Distribution of coefficient estimates \(\hat \beta\)

2.3 Variance estimates

3 Measurement error

4 Proofs

4.1 Proof of Central Limit Theorem for \(\hat \beta\)

5 Related posts

References

Corrections

Reuse

Citation