The Akaike Information Criterion (AIC) for the linear model \(Y = X \beta + \varepsilon\), with takes the form:
\[ \text{AIC}^{\text{(k)}} = \frac{(\mathbf Y-\mathbf X\hat \beta )^2}{\sigma ^2} + 2p \]
if the noise variance \(\sigma ^2 = \mathbb V(\varepsilon\vert X)\) is known, and:
\[ \text{AIC}^{\text{(u)}} = N\ln(\hat \sigma ^2) + 2(p + 1) \]
if \(\sigma^2\) is unknown. Here \(\hat \beta\) denotes the maximum-likelihood estimate of \(\beta\), and \(\hat \sigma ^2 = \frac{1}{N}(\mathbf Y -\mathbf X \hat \beta)^2\) the corresponding estimate of \(\sigma ^2\) if the latter is unknown; \(p\) is the dimension of the covariate vector \(X\).
One would expect knowledge on variance to have little effect on model selection for the mean, at least in a limit in which variance can be considered to be reasonably well estimated. In order to check that this is actually the case, we expand \(\text{AIC}^{\text{(u)}}\) differences to first order \(\hat \sigma _1 ^2 - \hat \sigma _2 ^2\):
\[ \begin{split} \text{AIC}^{\text{(u)}}_1-\text{AIC}^{\text{(u)}}_2 &= N\ln(\frac{\hat \sigma ^2_1}{\hat \sigma ^2_2}) + 2(p_1-p_2)\\ &\approx N\frac{\hat \sigma _{1}^2-\hat \sigma _2 ^2}{\hat \sigma _2 ^2} + 2(p_1-p_2)\\ & = \text{AIC}^{\text{(k)}}_1-\text{AIC}^{\text{(k)}}_2+N\frac{(\hat \sigma _{1}^2-\hat \sigma _2 ^2)(\sigma ^2-\hat \sigma _2 ^2)}{\hat \sigma _2 ^2\sigma^2} \end{split} \]
The approximation in the second line requires \(\vert \hat \sigma _1 ^2 - \hat \sigma _2 ^2\vert \ll\hat \sigma _2 ^2\). Furthermore, the last term in the final expression is a small fraction of \(\text{AIC}^{\text{(u)}}_1-\text{AIC}^{\text{(u)}}_2\) if \(|\sigma ^2 -\hat \sigma _2 ^2| \ll \sigma ^2\).
Putting these two conditions together, we obtain:
\[ |\hat \sigma _1 ^2 -\hat \sigma _2 ^2|,|\sigma ^2 -\hat \sigma _2 ^2| \ll \sigma ^2,\qquad \]
which means that \(\text{AIC}^{\text{(u)}}\) and \(\text{AIC}^{\text{(k)}}\) lead to the same model selection provided that the models involved in the AIC comparison estimate reasonably well the true variance.
Concluding remarks:
Although the maximum-likelihood estimates plugged in the AIC are derived from normal theory, the theorem about the equivalence of AIC selection in the known and unknown variance cases continues to hold irrespective of this assumption.
What happens in misspecified cases, in which \(\hat \sigma ^2\) does not consistently estimate \(\mathbb V(\varepsilon \vert X)\), either because of non-linearity or heteroskedasticity?
Reuse
Citation
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {AIC for the Linear Model: Known Vs. Unknown Variance},
date = {2024-03-13},
url = {https://vgherard.github.io/posts/2024-03-13-aic-for-the-linear-model-known-vs-unknown-variance/},
langid = {en}
}