Does knowledge of noise variance have any effect on model selection for the mean?
The Akaike Information Criterion (AIC) for the linear model \(Y = X \beta + \varepsilon\), with takes the form:
\[ \text{AIC}^{\text{(k)}} = \frac{(\mathbf Y-\mathbf X\hat \beta )^2}{\sigma ^2} + 2p \] if the noise variance \(\sigma ^2 = \mathbb V(\varepsilon\vert X)\) is known, and:
\[ \text{AIC}^{\text{(u)}} = N\ln(\hat \sigma ^2) + 2(p + 1) \]
if \(\sigma^2\) is unknown. Here \(\hat \beta\) denotes the maximum-likelihood estimate of \(\beta\), and \(\hat \sigma ^2 = \frac{1}{N}(\mathbf Y -\mathbf X \hat \beta)^2\) the corresponding estimate of \(\sigma ^2\) if the latter is unknown; \(p\) is the dimension of the covariate vector \(X\).
One would expect knowledge on variance to have little effect on model selection for the mean, at least in a limit in which variance can be considered to be reasonably well estimated. In order to check that this is actually the case, we expand \(\text{AIC}^{\text{(u)}}\) differences to first order \(\hat \sigma _1 ^2 - \hat \sigma _2 ^2\):
\[ \begin{split} \text{AIC}^{\text{(u)}}_1-\text{AIC}^{\text{(u)}}_2 &= N\ln(\frac{\hat \sigma ^2_1}{\hat \sigma ^2_2}) + 2(p_1-p_2)\\ &\approx N\frac{\hat \sigma _{1}^2-\hat \sigma _2 ^2}{\hat \sigma _2 ^2} + 2(p_1-p_2)\\ & = \text{AIC}^{\text{(k)}}_1-\text{AIC}^{\text{(k)}}_2+N\frac{(\hat \sigma _{1}^2-\hat \sigma _2 ^2)(\sigma ^2-\hat \sigma _2 ^2)}{\hat \sigma _2 ^2\sigma^2} \end{split} \] The approximation in the second line requires \(\vert \hat \sigma _1 ^2 - \hat \sigma _2 ^2\vert \ll\hat \sigma _2 ^2\). Furthermore, the last term in the final expression is a small fraction of \(\text{AIC}^{\text{(u)}}_1-\text{AIC}^{\text{(u)}}_2\) if \(|\sigma ^2 -\hat \sigma _2 ^2| \ll \sigma ^2\).
Putting these two conditions together, we obtain:
\[ |\hat \sigma _1 ^2 -\hat \sigma _2 ^2|,|\sigma ^2 -\hat \sigma _2 ^2| \ll \sigma ^2,\qquad \] which means that \(\text{AIC}^{\text{(u)}}\) and \(\text{AIC}^{\text{(k)}}\) lead to the same model selection provided that the models involved in the AIC comparison estimate reasonably well the true variance.
Concluding remarks:
Although the maximum-likelihood estimates plugged in the AIC are derived from normal theory, the theorem about the equivalence of AIC selection in the known and unknown variance cases continues to hold irrespective of this assumption.
What happens in misspecified cases, in which \(\hat \sigma ^2\) does not consistently estimate \(\mathbb V(\varepsilon \vert X)\), either because of non-linearity or heteroskedasticity?
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com/vgherard/vgherard.github.io/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Gherardi (2024, March 13). vgherard: AIC for the linear model: known vs. unknown variance. Retrieved from https://vgherard.github.io/posts/2024-03-13-aic-for-the-linear-model-known-vs-unknown-variance/
BibTeX citation
@misc{gherardi2024aic, author = {Gherardi, Valerio}, title = {vgherard: AIC for the linear model: known vs. unknown variance}, url = {https://vgherard.github.io/posts/2024-03-13-aic-for-the-linear-model-known-vs-unknown-variance/}, year = {2024} }