Maximum Likelihood

Valerio Gherardi https://vgherard.github.io
2024-03-14

Disclaimer. These are wild notes on Maximum Likelihood that require some deep labor limae session. Use at your own risk!

Let \(\mathcal Q \equiv\{\text d Q_{\theta} = q_\theta \,\text d \mu\}_{\theta \in \Theta}\) be a parametric family of probability measures dominated by some common measure \(\mu\). Consider the functional1:

\[ \theta ^* (P) = \arg \min_{\theta \in \Theta} \intop \text dP\,\ln \left(\frac{1}{q_\theta}\right) \tag{1}. \] This is the parameter of the best (in the cross-entropy sense) approximation of \(P\) within \(\mathcal Q\), which we assume to be unique.

If \(P\) represents the true probability distribution of the data under study, \(\theta ^*(P)\) is the target of ML estimation, in the general case in which \(P\) is not necessarily in \(\mathcal Q\). The ML estimate \(\hat \theta _N\) of \(\theta^*\) from an i.i.d. sample of \(N\) observations is2:

\[ \hat \theta _N \equiv \theta ^*(\hat P _N)=\arg \max_{\theta \in \Theta} \sum_{i=1}^N \ln ({q_\theta(Y_i)}), \tag{2} \] where \(\hat P _N\) is the empirical distribution of the sample.

Denoting:

\[ c_{P}(\theta) = \intop \text dP\,\ln \left(\frac{1}{q_\theta}\right),\tag{3} \]

we see that \(\theta^*\) is determined by the condition \(c_{P}'(\theta^*)=0\). From this, we can easily derive the first order variation of \(\theta ^*\) under a variation \(P \to P + \delta P\):

\[ \delta \theta ^* =\left(\intop \text dP\,I_{\theta ^*} \right)^{-1}\left(\intop \text d(\delta P)u_{\theta ^*}\right)\tag{4} \]

where we have defined:

\[ u_\theta = \frac{\partial }{\partial \theta} \ln q_\theta,\quad I_\theta = -\frac{\partial^2 }{\partial \theta ^2} \ln q_\theta.\tag{5} \] From (4) we can identify the influence function of the \(\theta ^*\) functional:

\[ \psi_P(y)=\left(\intop \text dP\,I_{\theta ^*} \right)^{-1}u_{\theta ^*}(y)\tag{6} \]

Then, from the standard theory of influence functions, we have:

\[ \hat \theta _N \approx \theta ^*+J ^{-1} U\tag{7} \] where we have defined:

\[ J\equiv \intop \text dP\,I_{\theta ^*},\quad U\equiv\frac{1}{N}\sum _{i=1}^Nu_{\theta ^*}(Y_i)\tag{8}. \] In particular, we obtain the Central Limit Theorem (CLT)

\[ \sqrt N(\hat \theta _N - \theta ^*) \overset{d}{\to} \mathcal N(0, J^{-1}KJ^{-1}),\tag{9} \]

with:

\[ K = \mathbb V(u_{\theta ^*}(Y)). \tag{10} \] The matrices \(K\) and \(J\) depend on the unknown value \(\theta ^*\), but we can readily construct plugin estimators:

\[ \hat J_N = -\frac{1}{N}\sum _{i=1}^NI_{\hat \theta _N}(Y_i),\quad\hat K_N = \frac{1}{N}\sum _{i=1}^Nu_{\hat \theta _N}(Y_i)u_{\hat \theta _N}(Y_i)^T,\tag{11} \] and estimate the variance of \(\hat \theta _N\) as:

\[ \widehat {\mathbb V}(\hat \theta _N) = \frac{\hat J _N ^{-1}\hat K_N\hat J_N ^{-1}}{N}\tag{12}, \] which is the usual Sandwich estimator. Finally, if \(P = Q_{\theta^*}\), then $J = K $, and the CLT (9) becomes simply \(\sqrt N(\hat \theta _N - \theta ^*) \overset{d}{\to} \mathcal N(0, J^{-1})\).

Let us now consider the following expansion of \(c_P(\hat \theta _N)\) which, we recall, is the cross-entropy of the ML model on the true distribution \(P\) (cf. (3)):

\[ \begin{split} c_P(\hat \theta _N) &= -\intop \text d P(y')\,\ln (q_{\hat \theta}(y'))\\ & \approx -\mathbb E'(\ln q_{\theta^*})+\frac{1}{2}(\hat \theta-\theta ^*)^TJ (\hat \theta-\theta ^*)\\ & \approx -\mathbb E'(\ln q_{\theta^*})+\frac{1}{2}U^TJ^{-1}U \end{split} \] Taking the expectation with respect to the training dataset, noting that \(\mathbb E(U_{\theta ^*}U_{\theta ^*}^T)=K_{\theta ^*}\), we get:

\[ \mathbb E (c_P(\hat \theta _N))\approx -\mathbb E'(\ln q_{\theta^*})+\frac{1}{2N}\text {Tr}(J^{-1}K) \tag{13} \] Now consider the in-sample estimate:

\[ \begin{split} c_{\hat P _N}(\hat \theta _N) &= -\frac{1}{N}\sum _{i=1}^N\ln q_{\hat \theta}(Y_i)\\ & \approx - \frac{1}{N}\sum _{i=1} ^N \ln q_{\theta^*}(Y_i)- U^T(\hat \theta _N-\theta^*)+ \frac{1}{2}(\hat \theta _N-\theta^*)^TJ(\hat \theta _N-\theta^*)\\ & \approx - \frac{1}{N}\sum _{i=1} ^N \ln q_{\theta^*}(Y_i)- U^TJ ^{-1} U+ \frac{1}{2}U^TJ ^{-1}\hat J_N J^{-1}U\\ & \approx - \frac{1}{N}\sum _{i=1} ^N \ln q_{\theta^*}(Y_i)- \frac{1}{2}U^TJ ^{-1} U. \end{split} \] Taking the expectation:

\[ \mathbb E (c_{\hat P _N}(\hat \theta _N)) = -\mathbb E'(\ln q_{\theta^*})-\frac{1}{2N}\text{Tr}(J^{-1}K)\tag{14} \] Comparing Eqs. (14) and (13) we see that:

\[ \text{TIC}\equiv -\frac{1}{N}\sum _{i=1}^N\ln q_{\hat \theta}(Y_i)+\frac{1}{N}\text{Tr}(J^{-1}K)\tag{15} \] provides an asymptotically unbiased estimate of \(\mathbb E (c_P(\hat \theta _N))\), the expected cross-entropy of a model from family \(\mathcal Q\) estimated on a sample of \(N\) observations.

The previous derivation assumed the \(Y_i\) to be i.i.d. and does not apply, strictly speaking, to the case of regression, for which we need some more machinery. Assume that the pairs \((X_i,\,Y_i)\) are drawn independently from a joint \(X-Y\) distribution. Instead of (3), we consider:

We define, as in the i.i.d. case:

\[ \begin{split} \theta ^*(P;\mathbf X)&=\arg\max _{\theta} \frac{1}{N}\sum_{i=1}^N\intop \text dP(y\vert X_i)\,\ln \left(\frac{1}{q_{\theta}(y\vert X_i)}\right),\\ \theta ^*(P)&=\arg\max _{\theta} \intop \text dP(y,x)\,\ln \left(\frac{1}{q_{\theta}(y\vert X_i)}\right),\\ \hat \theta _N&=\arg\max _{\theta} \sum _{i=1}^N\ln \left(\frac{1}{q_{\theta}(Y_i\vert X_i)}\right) \end{split}\tag{16} \] Noticing that \(\hat \theta _N\) is a plugin estimate of \(\theta ^*\), we can repeat mutatis mutandis the steps leading to the CLT (9), which is also valid in this case.

Rather than doing so, let us consider \(\hat \theta _N\) as the \(\mathbf X\)-conditional plugin estimate of \(\theta ^*(P;\mathbf X)\), and the latter as a plugin estimate of \(\theta ^*(P)\) interpreted as a functional of the \(X\) marginal distribution. Then, a parallel derivation to the one provided above for the i.i.d. case shows the conditional convergence in distribution:

\[ \sqrt N(\hat \theta _N - \theta ^*(P;\mathbf X))\overset{d \vert \mathbf X}{\to} \mathcal N(0, J_{N}^{-1}(\mathbf X)K_{N}(\mathbf X)J_{N}^{-1}(\mathbf X))\tag{17}. \] as well as the unconditional convergence:

\[ \sqrt N(\theta ^*(P;\mathbf X) - \theta ^*(P))\overset{d }{\to} \mathcal N(0, J^{-1}\tilde K J^{-1})\tag{18}. \]

where the various matrices are defined as:

\[ \begin{split} J_N(\mathbf X)&\equiv \frac{1}{N}\sum _{i=1}^N\mathbb E\left[I _{\theta} \bigg\vert X=X_i\right]\bigg\vert_{\theta = \theta ^*(\mathbf X)},\\ \quad K_N(\mathbf X)&\equiv\frac{1}{N}\sum _{i=1}^N\mathbb V\left[u _{\theta }\bigg\vert X=X_i\right]\bigg\vert_{\theta = \theta ^*(\mathbf X)} \end{split} \tag{19}. \] and: \[ \begin{split} J&\equiv \mathbb E\left[I_{\theta^*} \right],\\ \quad \tilde K&\equiv\mathbb V\left[\mathbb E\left(u_{\theta ^*} \vert X\right)\right] \end{split} \tag{20}. \] Here \(I_\theta\) and \(u_\theta\) are again defined as in (5), but regarded as functions of the random pair \(\{(X,\,Y)\}\), rather than \(Y\) alone. Although Eqs. (19) are written for \(\theta = \theta ^*(\mathbf X)\), to the order of the present approximation we may as well substitute \(\theta ^*(\mathbf X) \approx \theta ^*\). Doing this, we can easily see that \(J_N(\mathbf X) \to J\), and \(K_N(\mathbf X) \to \mathbb E\left[\mathbb V\left(u_{\theta } \vert X\right)\right]\bigg\vert_{\theta = \theta ^*}\). This can be used to find the unconditional variance of \(\hat \theta _N\):

\[ \begin{split} \mathbb V(\hat \theta _N) &=\mathbb E (\mathbb V(\hat \theta _N \vert \mathbf X))+\mathbb V (\mathbb E(\hat \theta _N \vert \mathbf X))\\ &=\mathbb E (\mathbb V(\hat \theta _N \vert \mathbf X))+\mathbb V (\theta ^*(\mathbf X))\\ &=J^{-1}\left(\mathbb V\left[\mathbb E\left(u_{\theta ^*} \vert X\right)\right]+\mathbb E\left[\mathbb V\left(u_{\theta ^*} \vert X\right)\right]\right)J^{-1}\\ &= J^{-1} KJ^{-1} \end{split} \] with \(K = \mathbb V(u_{\theta^*})\) as in the i.i.d. case, in agreement with the CLT (9). Our derivation here shows how the variance of \(\hat \theta _N\) decomposes into a component due to the variability of \(X\), and a component due to the residual variability of \(Y\) given \(X\).

The corresponding result for the TIC (15) is slightly less straightforward. Repeating the steps leading to this equation for a fixed sample of regressors \(\mathbf X\), we find that:

\[ \mathbb E (\text{TIC}\vert \mathbf X)=\intop \prod_{i=1}^N\text dP(y_i\vert X_i)\,\,\frac{1}{N}\sum_{j=1}^N\intop \text dP(y^\prime\vert X_j)\ln \left(\frac{1}{q_{\hat \theta_N}(y^\prime \vert X_j)}\right),\tag{21} \] where the outer integral is a conditional expectation on the sample responses, while the inner integrals are expectations with respect to a new response associated to a sample regressor \(X_i\). If we now average over \(\mathbf X\), we
find:

\[ \mathbb E (\text{TIC})=\intop \prod_{i=1}^N\text dP(x_i,y_i)\,\,\frac{1}{N}\sum_{j=1}^N\intop \text dP(y^\prime\vert x_j)\ln \left(\frac{1}{q_{\hat \theta_N}(y^\prime \vert x_i)}\right)=\mathbb E(\text{CE}_\text{in}).\tag{22} \] The right-hand side is the expected in-sample cross-entropy, which is in general different from the extra-sample cross-entropy:

\[ \mathbb E(\text{CE}) =\intop \prod_{i=1}^N\text dP(x_i,y_i)\intop \text dP(x^\prime,y^\prime)\ln \left(\frac{1}{q_{\hat \theta_N}(y^\prime \vert x^\prime)}\right). \tag{23} \]

References

Claeskens, Gerda, and Nils Lid Hjort. 2008. “Model Selection and Model Averaging.” Cambridge Books.
Freedman, David A. 2006. “On the so-Called ‘Huber Sandwich Estimator’ and ‘Robust Standard Errors’.” The American Statistician 60 (4): 299–302.
Shalizi, Cosma. 2024. Advanced Data Analysis from an Elementary Point of View. https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/.
White, Halbert. 1982. “Maximum Likelihood Estimation of Misspecified Models.” Econometrica: Journal of the Econometric Society, 1–25.

  1. The definition does not depend on the representations \(q_\theta = \frac{\text d Q_\theta}{\text d \mu}\) chosen for the \(\mu\)-density of \(Q_\theta\) if \(P\) is also absolutely continuous with respect to \(\mu\), which we tacitly assume. Typically \(\mu\) would be some relative of Lebesgue or counting measures, in continuous and discrete settings respectively.↩︎

  2. As a random variable, \(\hat \theta _N\) is also independent (modulo a measure zero set) of the specific \(L_1\) representation \(q_\theta\) if \(P\) is absolutely continuous with respect to \(\mu\).↩︎

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com/vgherard/vgherard.github.io/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Gherardi (2024, March 14). vgherard: Maximum Likelihood. Retrieved from https://vgherard.github.io/notebooks/maximum-likelihood/

BibTeX citation

@misc{gherardi2024maximum,
  author = {Gherardi, Valerio},
  title = {vgherard: Maximum Likelihood},
  url = {https://vgherard.github.io/notebooks/maximum-likelihood/},
  year = {2024}
}