Disclaimer. These are wild notes on Maximum Likelihood that require some deep labor limae session. Use at your own risk!
Let \(\mathcal Q \equiv\{\text d Q_{\theta} = q_\theta \,\text d \mu\}_{\theta \in \Theta}\) be a parametric family of probability measures dominated by some common measure \(\mu\). Consider the functional1:
\[ \theta ^* (P) = \arg \min_{\theta \in \Theta} \intop \text dP\,\ln \left(\frac{1}{q_\theta}\right) \tag{1}. \] This is the parameter of the best (in the cross-entropy sense) approximation of \(P\) within \(\mathcal Q\), which we assume to be unique.
If \(P\) represents the true probability distribution of the data under study, \(\theta ^*(P)\) is the target of ML estimation, in the general case in which \(P\) is not necessarily in \(\mathcal Q\). The ML estimate \(\hat \theta _N\) of \(\theta^*\) from an i.i.d. sample of \(N\) observations is2:
\[ \hat \theta _N \equiv \theta ^*(\hat P _N)=\arg \max_{\theta \in \Theta} \sum_{i=1}^N \ln ({q_\theta(Y_i)}), \tag{2} \] where \(\hat P _N\) is the empirical distribution of the sample.
Denoting:
\[ c_{P}(\theta) = \intop \text dP\,\ln \left(\frac{1}{q_\theta}\right),\tag{3} \]
we see that \(\theta^*\) is determined by the condition \(c_{P}'(\theta^*)=0\). From this, we can easily derive the first order variation of \(\theta ^*\) under a variation \(P \to P + \delta P\):
\[ \delta \theta ^* =\left(\intop \text dP\,I_{\theta ^*} \right)^{-1}\left(\intop \text d(\delta P)u_{\theta ^*}\right)\tag{4} \]
where we have defined:
\[ u_\theta = \frac{\partial }{\partial \theta} \ln q_\theta,\quad I_\theta = -\frac{\partial^2 }{\partial \theta ^2} \ln q_\theta.\tag{5} \] From (4) we can identify the influence function of the \(\theta ^*\) functional:
\[ \psi_P(y)=\left(\intop \text dP\,I_{\theta ^*} \right)^{-1}u_{\theta ^*}(y)\tag{6} \]
Then, from the standard theory of influence functions, we have:
\[ \hat \theta _N \approx \theta ^*+J ^{-1} U\tag{7} \] where we have defined:
\[ J\equiv \intop \text dP\,I_{\theta ^*},\quad U\equiv\frac{1}{N}\sum _{i=1}^Nu_{\theta ^*}(Y_i)\tag{8}. \] In particular, we obtain the Central Limit Theorem (CLT)
\[ \sqrt N(\hat \theta _N - \theta ^*) \overset{d}{\to} \mathcal N(0, J^{-1}KJ^{-1}),\tag{9} \]
with:
\[ K = \mathbb V(u_{\theta ^*}(Y)). \tag{10} \] The matrices \(K\) and \(J\) depend on the unknown value \(\theta ^*\), but we can readily construct plugin estimators:
\[ \hat J_N = -\frac{1}{N}\sum _{i=1}^NI_{\hat \theta _N}(Y_i),\quad\hat K_N = \frac{1}{N}\sum _{i=1}^Nu_{\hat \theta _N}(Y_i)u_{\hat \theta _N}(Y_i)^T,\tag{11} \] and estimate the variance of \(\hat \theta _N\) as:
\[ \widehat {\mathbb V}(\hat \theta _N) = \frac{\hat J _N ^{-1}\hat K_N\hat J_N ^{-1}}{N}\tag{12}, \] which is the usual Sandwich estimator. Finally, if \(P = Q_{\theta^*}\), then $J = K $, and the CLT (9) becomes simply \(\sqrt N(\hat \theta _N - \theta ^*) \overset{d}{\to} \mathcal N(0, J^{-1})\).
Let us now consider the following expansion of \(c_P(\hat \theta _N)\) which, we recall, is the cross-entropy of the ML model on the true distribution \(P\) (cf. (3)):
\[ \begin{split} c_P(\hat \theta _N) &= -\intop \text d P(y')\,\ln (q_{\hat \theta}(y'))\\ & \approx -\mathbb E'(\ln q_{\theta^*})+\frac{1}{2}(\hat \theta-\theta ^*)^TJ (\hat \theta-\theta ^*)\\ & \approx -\mathbb E'(\ln q_{\theta^*})+\frac{1}{2}U^TJ^{-1}U \end{split} \] Taking the expectation with respect to the training dataset, noting that \(\mathbb E(U_{\theta ^*}U_{\theta ^*}^T)=K_{\theta ^*}\), we get:
\[ \mathbb E (c_P(\hat \theta _N))\approx -\mathbb E'(\ln q_{\theta^*})+\frac{1}{2N}\text {Tr}(J^{-1}K) \tag{13} \] Now consider the in-sample estimate:
\[ \begin{split} c_{\hat P _N}(\hat \theta _N) &= -\frac{1}{N}\sum _{i=1}^N\ln q_{\hat \theta}(Y_i)\\ & \approx - \frac{1}{N}\sum _{i=1} ^N \ln q_{\theta^*}(Y_i)- U^T(\hat \theta _N-\theta^*)+ \frac{1}{2}(\hat \theta _N-\theta^*)^TJ(\hat \theta _N-\theta^*)\\ & \approx - \frac{1}{N}\sum _{i=1} ^N \ln q_{\theta^*}(Y_i)- U^TJ ^{-1} U+ \frac{1}{2}U^TJ ^{-1}\hat J_N J^{-1}U\\ & \approx - \frac{1}{N}\sum _{i=1} ^N \ln q_{\theta^*}(Y_i)- \frac{1}{2}U^TJ ^{-1} U. \end{split} \] Taking the expectation:
\[ \mathbb E (c_{\hat P _N}(\hat \theta _N)) = -\mathbb E'(\ln q_{\theta^*})-\frac{1}{2N}\text{Tr}(J^{-1}K)\tag{14} \] Comparing Eqs. (14) and (13) we see that:
\[ \text{TIC}\equiv -\frac{1}{N}\sum _{i=1}^N\ln q_{\hat \theta}(Y_i)+\frac{1}{N}\text{Tr}(J^{-1}K)\tag{15} \] provides an asymptotically unbiased estimate of \(\mathbb E (c_P(\hat \theta _N))\), the expected cross-entropy of a model from family \(\mathcal Q\) estimated on a sample of \(N\) observations.
The previous derivation assumed the \(Y_i\) to be i.i.d. and does not apply, strictly speaking, to the case of regression, for which we need some more machinery. Assume that the pairs \((X_i,\,Y_i)\) are drawn independently from a joint \(X-Y\) distribution. Instead of (3), we consider:
We define, as in the i.i.d. case:
\[ \begin{split} \theta ^*(P;\mathbf X)&=\arg\max _{\theta} \frac{1}{N}\sum_{i=1}^N\intop \text dP(y\vert X_i)\,\ln \left(\frac{1}{q_{\theta}(y\vert X_i)}\right),\\ \theta ^*(P)&=\arg\max _{\theta} \intop \text dP(y,x)\,\ln \left(\frac{1}{q_{\theta}(y\vert X_i)}\right),\\ \hat \theta _N&=\arg\max _{\theta} \sum _{i=1}^N\ln \left(\frac{1}{q_{\theta}(Y_i\vert X_i)}\right) \end{split}\tag{16} \] Noticing that \(\hat \theta _N\) is a plugin estimate of \(\theta ^*\), we can repeat mutatis mutandis the steps leading to the CLT (9), which is also valid in this case.
Rather than doing so, let us consider \(\hat \theta _N\) as the \(\mathbf X\)-conditional plugin estimate of \(\theta ^*(P;\mathbf X)\), and the latter as a plugin estimate of \(\theta ^*(P)\) interpreted as a functional of the \(X\) marginal distribution. Then, a parallel derivation to the one provided above for the i.i.d. case shows the conditional convergence in distribution:
\[ \sqrt N(\hat \theta _N - \theta ^*(P;\mathbf X))\overset{d \vert \mathbf X}{\to} \mathcal N(0, J_{N}^{-1}(\mathbf X)K_{N}(\mathbf X)J_{N}^{-1}(\mathbf X))\tag{17}. \] as well as the unconditional convergence:
\[ \sqrt N(\theta ^*(P;\mathbf X) - \theta ^*(P))\overset{d }{\to} \mathcal N(0, J^{-1}\tilde K J^{-1})\tag{18}. \]
where the various matrices are defined as:
\[ \begin{split} J_N(\mathbf X)&\equiv \frac{1}{N}\sum _{i=1}^N\mathbb E\left[I _{\theta} \bigg\vert X=X_i\right]\bigg\vert_{\theta = \theta ^*(\mathbf X)},\\ \quad K_N(\mathbf X)&\equiv\frac{1}{N}\sum _{i=1}^N\mathbb V\left[u _{\theta }\bigg\vert X=X_i\right]\bigg\vert_{\theta = \theta ^*(\mathbf X)} \end{split} \tag{19}. \] and: \[ \begin{split} J&\equiv \mathbb E\left[I_{\theta^*} \right],\\ \quad \tilde K&\equiv\mathbb V\left[\mathbb E\left(u_{\theta ^*} \vert X\right)\right] \end{split} \tag{20}. \] Here \(I_\theta\) and \(u_\theta\) are again defined as in (5), but regarded as functions of the random pair \(\{(X,\,Y)\}\), rather than \(Y\) alone. Although Eqs. (19) are written for \(\theta = \theta ^*(\mathbf X)\), to the order of the present approximation we may as well substitute \(\theta ^*(\mathbf X) \approx \theta ^*\). Doing this, we can easily see that \(J_N(\mathbf X) \to J\), and \(K_N(\mathbf X) \to \mathbb E\left[\mathbb V\left(u_{\theta } \vert X\right)\right]\bigg\vert_{\theta = \theta ^*}\). This can be used to find the unconditional variance of \(\hat \theta _N\):
\[ \begin{split} \mathbb V(\hat \theta _N) &=\mathbb E (\mathbb V(\hat \theta _N \vert \mathbf X))+\mathbb V (\mathbb E(\hat \theta _N \vert \mathbf X))\\ &=\mathbb E (\mathbb V(\hat \theta _N \vert \mathbf X))+\mathbb V (\theta ^*(\mathbf X))\\ &=J^{-1}\left(\mathbb V\left[\mathbb E\left(u_{\theta ^*} \vert X\right)\right]+\mathbb E\left[\mathbb V\left(u_{\theta ^*} \vert X\right)\right]\right)J^{-1}\\ &= J^{-1} KJ^{-1} \end{split} \] with \(K = \mathbb V(u_{\theta^*})\) as in the i.i.d. case, in agreement with the CLT (9). Our derivation here shows how the variance of \(\hat \theta _N\) decomposes into a component due to the variability of \(X\), and a component due to the residual variability of \(Y\) given \(X\).
The corresponding result for the TIC (15) is slightly less straightforward. Repeating the steps leading to this equation for a fixed sample of regressors \(\mathbf X\), we find that:
\[
\mathbb E (\text{TIC}\vert \mathbf X)=\intop \prod_{i=1}^N\text dP(y_i\vert X_i)\,\,\frac{1}{N}\sum_{j=1}^N\intop \text dP(y^\prime\vert X_j)\ln \left(\frac{1}{q_{\hat \theta_N}(y^\prime \vert X_j)}\right),\tag{21}
\]
where the outer integral is a conditional expectation on the sample responses,
while the inner integrals are expectations with respect to a new response
associated to a sample regressor \(X_i\). If we now average over \(\mathbf X\), we
find:
\[ \mathbb E (\text{TIC})=\intop \prod_{i=1}^N\text dP(x_i,y_i)\,\,\frac{1}{N}\sum_{j=1}^N\intop \text dP(y^\prime\vert x_j)\ln \left(\frac{1}{q_{\hat \theta_N}(y^\prime \vert x_i)}\right)=\mathbb E(\text{CE}_\text{in}).\tag{22} \] The right-hand side is the expected in-sample cross-entropy, which is in general different from the extra-sample cross-entropy:
\[ \mathbb E(\text{CE}) =\intop \prod_{i=1}^N\text dP(x_i,y_i)\intop \text dP(x^\prime,y^\prime)\ln \left(\frac{1}{q_{\hat \theta_N}(y^\prime \vert x^\prime)}\right). \tag{23} \]
The definition does not depend on the representations \(q_\theta = \frac{\text d Q_\theta}{\text d \mu}\) chosen for the \(\mu\)-density of \(Q_\theta\) if \(P\) is also absolutely continuous with respect to \(\mu\), which we tacitly assume. Typically \(\mu\) would be some relative of Lebesgue or counting measures, in continuous and discrete settings respectively.↩︎
As a random variable, \(\hat \theta _N\) is also independent (modulo a measure zero set) of the specific \(L_1\) representation \(q_\theta\) if \(P\) is absolutely continuous with respect to \(\mu\).↩︎
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com/vgherard/vgherard.github.io/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Gherardi (2024, March 14). vgherard: Maximum Likelihood. Retrieved from https://vgherard.github.io/notebooks/maximum-likelihood/
BibTeX citation
@misc{gherardi2024maximum, author = {Gherardi, Valerio}, title = {vgherard: Maximum Likelihood}, url = {https://vgherard.github.io/notebooks/maximum-likelihood/}, year = {2024} }