Introduction
In a previous post I introduced the problem of Selective Inference and illustrated, in a simplified setting, how selection generally affects the coverage of confidence intervals - when they are both selected and constructed using the same data. While the example was (hopefully) helpful to build some intuition, in order to discuss “How to get away with selection” in a comprehensive manner we need to make a few clarifications. In particular, we need to answer the following questions:
- What is the target of our Selective Inference?
- What statistical properties would we like our inferences to have?
Searching through the literature, I realized there exist a bunch of variations on these two themes, which give rise to different mathematical formalisms. Specifying these points is mandatory for any further discussion, so my main goal here is to present these different points of view and explain some of their pros and cons.
Mathematical Framework
Regression and parameter estimation
In order to avoid getting carried away with too much abstraction, I will focus on a specific type of problem, that is parameter estimation in regression. As far as I can tell, this represents no serious loss in generality, and most of the notions I’m going to outline would carry over to more general problems in a straightforward manner.
Broadly speaking, the goal of regression is to understand the dependence of a set of random variables \(Y\) from another set of random variables \(X\). More precisely, we’re interested in the conditional probability distribution of \(Y\), conditioned on the observation of \(X\), which can always be represented as:
\[ Y = f(X)+\varepsilon,\qquad \mathbb E(\varepsilon|X)\equiv 0. \tag{1}\]
where \(f(X) = \mathbb E(Y|X)\) is the conditional mean of \(Y|X\), and \(\varepsilon\) is a random variable with vanishing conditional mean, sometimes called the “error term”. Parameter estimation means that we have (somehow) chosen functional forms for the conditional mean and for the probability distribution of the error term, and we want to provide estimates for the parameters defining these two functions.
Enter selection
Now, in many applications we actually don’t have much insight about the correct functional form \(f\), nor of the distribution of the error term \(\varepsilon\). Given a dataset of experimental observations of \(Y\) and \(X\), we are thus faced with two tasks:
Selection. Choose an adequate model \(\hat M = (\hat f,\,\hat \varepsilon)\) for the true \(f\) and \(\varepsilon\), usually from a (more or less) pre-specified family of initial guesses \(\mathcal M =\{(f_i,\varepsilon_i)\}_i\), using a (more or less) pre-specified criterion.
Post-Selection Inference. Perform inference with the chosen model. In the study case we’re considering, this amounts to provide confidence intervals for model parameters.
It is, of course, the need to use the same data for both tasks which gives rise to complications.
Inferential target
We now come to the first question raised in the Introduction, regarding the nature of the inferential target. And now more concretely: what are the true values of the parameters we’re trying to estimate? One can appreciate that the answer necessarily depends on how we consider the final output of the modeling procedure:
- (Model Trusting) As the true data generating process, or
- (Assumption Lean) As an approximation of the (partially or totally unknown) data generating process, chosen in a data-driven fashion within a family of initial guesses \(\mathcal M\).
According to the first interpretation, there’s no room for ambiguity: the targets of our estimates should clearly be the true parameter values, whose definition does not depend on any modeling choice. The second interpretation, on the other hand, leaves a certain amount of freedom in this respect. Here, I will follow the point of view advocated by (Berk et al. 2013), according to which the target parameters are those providing the best approximation1 to the true data generating process, according to the functional form chosen in the selection stage.
I believe both positions have their merits and flaws, and which one is more appropriate largely depends on context. In a reductionist field like High Energy Physics, whose eventual goal is to explain the fundamental laws of Nature, the Model Trusting point of view is usually taken, and with good reason. When studying more emergent phenomena, on the other hand, the quest for fundamental laws is often meaningless (or at best wishful thinking), and the Assumption Lean standpoint looks more reasonable. In any case, here the differences are not merely philosophical ones, as the two interpretations give rise to different mathematical formalisms.
In the following posts, I will be mostly focusing on the Assumption Lean point of view. In my opinion, this has two big advantages2:
- Conceptual: Inferences have a well-defined meaning even when the model is misspecified3 - which, apart from quite particular cases (see above), accounts for the great majority of cases encountered by data analysts in the practice.
- Mathematical: It allows to reduce the problem of selective inference to that of simultaneous inference (more on this below). For the latter type of problems, the theory of multiple testing readily provides at least conservative bounds.
Notions of coverage
In addition to the conceptual distinction about the interpretation of the selected model, there is also a technical distinction regarding the type of coverage guarantees that selective confidence intervals should be endowed with (this is the concrete version of the second question posed in the Introduction).
Here are some of the notions of coverage I’ve come across:
- Marginal coverage over the selected parameters. We bound at level \(\alpha\) the probability that our procedure constructs any non-covering confidence interval for model parameters \(\beta_i\). Denote by \(\widehat M\) the selected model and, with abuse of notation, the corresponding set of selected parameters. If \(\widehat{\text{CI}}_i\) are the confidence intervals for parameters \(\beta _i\), we require:
\[ \text{Pr}(\beta _i \in \widehat{\text{CI}}_i\,\,\forall i \in \widehat M) \geq 1-\alpha \tag{2}\]
- Conditional coverage over the selected parameters. We bound at level \(\alpha\) the conditional probability of constructing a non-covering confidence interval, conditioned on the outcome of selection \(\widehat M\). If \(m\) is the selected model, we require:
\[ \text{Pr}(\beta _i \in \widehat{\text{CI}}_i\,\,\forall i \in m|\,\widehat M=m) \geq 1-\alpha \tag{3}\]
- False Coverage Rate. We bound at level \(q\)4 the expected fraction of non-covering confidence intervals out of all intervals constructed:
\[ \mathbb E \left( \dfrac{|i \in \widehat M \colon \ \beta_i \in \widehat{\text{CI}}_i|}{|\widehat M|} \right) \geq1-q \tag{4}\]
where \(|S|\) denotes the cardinality of a set \(S\).
Notice that the random variables in the previous equations are \(\widehat M\) and \(\widehat{\text{CI}}_i\) (denoted by a hat), whereas the true coefficients \(\beta_i\) and the selected set \(m\) in the case of conditional coverage (Equation 3) are fixed quantities. Variations of these measures focusing on single coefficients are also possible.
In practice, in the Assumption Lean framework I just introduced, all these coverage measures would not be computed under the selected model’s probability distribution, but rather under a pre-fixed, more general model for the true probability distribution of \(Y\) conditional on \(X\). We may, for instance, assume that the true error term \(\varepsilon\) in Equation 1 is gaussian with constant (\(X\)-independent) variance, without making any further assumption on \(f(X)\). With enough data, we may even be able to bypass any assumption at all, and compute all relevant quantiles using a bootstrap (Kuchibhotla et al. 2020).
In the Model Trusting framework, on the other hand, the conditional coverage measure would be computed under the selected model… and I’m honestly not sure whether it’s possible to make sense of the other two measures in this framework.
Selective vs. Simultaneous Inference
The connection between selective and simultaneous inference can now be understood, through the notion of marginal coverage. In fact, suppose that we were able to provide simultaneous coverage for all parameters (selected or not):
\[ \text{Pr}(\beta _i \in \widehat{\text{CI}}_i\,\,\forall i) \geq 1-\alpha \tag{5}\]
Then, it’s easy to see that the same confidence interval would also provide marginal coverage over the selected parameters. In order to see that, simply observe that the simultaneous coverage event can be decomposed as:
\[ (\beta _i \in \widehat {\text{CI}}_i\,\,\forall i) = (\beta _i \in \widehat{\text{CI}}_i\,\,\forall i \in \widehat M) \cap (\beta _i \in \widehat{\text{CI}}_i\,\,\forall i \notin \widehat M) \]
which implies that:
\[ \text{Pr}(\beta _i \in \widehat{\text{CI}}_i\,\,\forall i \in \widehat M) \geq \text{Pr}(\beta _i \in \widehat{\text{CI}}_i\,\,\forall i) \geq 1-\alpha, \] {#eq:simultaneous-implies-marginal-coverage}
that is simultaneous coverage implies marginal coverage over the selected parameters. In fact, with a few more set-theory manipulations, one can arrive to a powerful Lemma (see Kuchibhotla et al. 2020 for details): controlling the marginal coverage Equation 2 at level \(\alpha\) for any model selection procedure5 is equivalent to controlling simultaneous coverage for all possible model selections.
This provides us a first, very simple recipe for selective inference, which can be applied whenever one is able to construct confidence intervals for parameters in the absence of selection: use any procedure (e.g. Bonferroni corrections) which controls simultaneous coverage for all parameters we may select a priori.
Conclusions
This was a long and somewhat abstract post, so perhaps the best way to conclude is with some bottom lines:
- When performing model-based inference, nothing forces us to make working hypotheses about the correctness of the model we arrive at. Not making such assumptions corresponds to what I called an Assumption Lean framework.
- In an Assumption Lean framework, the inferential targets are, in general, the best approximations to the truth allowed by the selected model.
- There exist many type of coverage guarantees for selective confidence intervals.
- Bounding the probability of any false coverage statement (“marginal coverage over the selected parameters”) allows to turn a problem of selective inference into one of simultaneous inference.
In particular, it is worth to mention that the last observation lead us to a simple recipe for constructing (somewhat conservative, but valid) selective confidence intervals with marginal coverage. In the posts which follow, I will discuss some more advanced methods which produce confidence intervals satisfying the requirements discussed here.
References
Footnotes
Where what’s to be considered best is defined in terms of some reasonable metric. For instance, for the conditional mean \(f(X)\) of a continuous response \(Y\), a convenient target \(f^*(X)\) within a prescribed family of functions \(\mathcal F\) can be defined by \(f^* =\arg\min _{\phi \in \mathcal F} \mathbb E (\vert f(X) - \phi (X)\vert^2)\).↩︎
There’s also a third advantage, which is that I find much harder to think about selective inference from the Model Trusting point of view, hence to write blog posts about it - but that’s likely a limitation of my imagination, rather than of the point of view itself.↩︎
A cool word for “wrong”.↩︎
Why \(q\) and not \(\alpha\)? Ask (Benjamini and Yekutieli 2005).↩︎
It is assumed that the selection is performed from a from a fixed family of models \(\mathcal M\).↩︎
Reuse
Citation
@online{gherardi2022,
author = {Gherardi, Valerio},
title = {How to Get Away with Selection. {Part} {II:} {Mathematical}
{Framework}},
date = {2022-11-25},
url = {https://vgherard.github.io/posts/2022-11-07-posi-2/posi-2.html},
langid = {en}
}