Given random variables \(Y\colon \Omega \to \mathbb R\) and \(X\colon \Omega \to \mathbb R ^{p}\) defined on an event space \(\Omega\), denote:
\[ \beta = \arg \min _{\beta ^\prime } \mathbb E[(Y-X \beta^\prime )^2]= \mathbb E(X^TX)^{-1}\mathbb E(X^TY), \tag{1}\]
so that \(X \beta\) is the best linear predictor of \(Y\) in terms of \(X\) (\(X\) is regarded as a row vector).
Let \((\textbf Y, \textbf X)\) be independent samples from the joint \(XY\) distribution, with independent observations stacked vertically in \(N \times 1\) and \(N \times p\) matrices respectively, as customary. Then the usual Ordinary Least Squares (OLS) estimator of \(\beta\) is given by:
\[ \hat \beta = \arg \min _{\beta ^\prime}(\textbf Y - \textbf X \beta ^\prime)^2=(\textbf X^T\textbf X)^{-1} \textbf X^T \textbf Y. \tag{2}\]
This is a consistent, but generally biased estimator of \(\beta\).
Comparing Equation 1 and Equation 2, consistency follows immediately from the law of large numbers and continuity. In order to show that \(\mathbb E (\hat \beta) \neq \beta\) in general, it is sufficient to provide an example.
Consider, for instance (example adapted from D.A. Freedman):
\[ X \sim \mathcal N (0, 1),\qquad Y=X(1+aX^2) \] Recalling that \(\mathbb E (X^4) = 3\) for the standard normal, we have:
\[ \beta = 1+3a, \] where we have ignored a potential intercept term (which would vanish here, since \(\mathbb E (Y) = 0\)). To compute \(\mathbb E (\hat \beta)\), we use the identity \(\frac{e^{-z}}{z} = \intop _1 ^\infty \text d t\, e ^{-zt}\) to rewrite this expected value as:
\[ \begin{split} \mathbb E (\hat \beta) & = (2 \pi)^{-N/2} \intop \text d\textbf X \,e^{-\sum _j X_i ^2 /2} \dfrac{\sum _i X_i^2(1+aX_i^2)}{\sum _i X_i^2} = \frac{N}{2}\intop_1 ^\infty \text d t\,I(t) \\ I(t) & \equiv (2 \pi)^{-N/2} \intop \text d\textbf X\, e^{-t \sum _j X_j ^2 /2}X_1^2(1+aX_1^2) \end{split} \] The inner integral can be computed easily:
\[ I(t) = t^{-\frac{N}{2}}(\frac{1}{t}+a\frac{3}{t^2}) \] and we eventually find: \[ \mathbb E (\hat \beta) = 1+3 a\frac{N}{N+2} \]
The bias is thus given by:
\[ \beta - \mathbb E (\hat \beta) = \frac{6a}{N+2} \] This vanishes linearly, in agreement with the fact that \(\sqrt N (\hat \beta - \beta )\) converges in probability to a gaussian with zero mean and finite variance (which requires the bias to be \(o(N^{-1/2})\)).
Reuse
Citation
@online{gherardi2023,
author = {Gherardi, Valerio},
title = {Consistency and Bias of {OLS} Estimators},
date = {2023-05-12},
url = {https://vgherard.github.io/posts/2023-05-12-consistency-and-bias-of-ols-estimators/},
langid = {en}
}