Mathematicals details on Selective Inference, model misspecification and coverage guarantees.

In a previous post I introduced the problem of
Selective Inference and illustrated, in a simplified setting, how selection
generally affects the coverage of confidence intervals - when they are both
*selected* and *constructed* using the same data. While the example was
(hopefully) helpful to build some intuition, in order to discuss
*“How to get away with selection”* in a comprehensive manner we need to make a
few clarifications. In particular, we need to answer the following questions:

- What is the target of our Selective Inference?
- What statistical properties would we like our inferences to have?

Searching through the literature, I realized there exist a bunch of variations on these two themes, which give rise to different mathematical formalisms. Specifying these points is mandatory for any further discussion, so my main goal here is to present these different points of view and explain some of their pros and cons.

In order to avoid getting carried away with too much abstraction, I will focus
on a specific type of problem, that is *parameter estimation in regression*. As
far as I can tell, this represents no serious loss in generality, and most of
the notions I’m going to outline would carry over to more general problems in a
straightforward manner.

Broadly speaking, the goal of regression is to understand the dependence of a set of random variables \(Y\) from another set of random variables \(X\). More precisely, we’re interested in the conditional probability distribution of \(Y\), conditioned on the observation of \(X\), which can always be represented as:

\[
Y = f(X)+\varepsilon,\qquad \mathbb E(\varepsilon|X)\equiv 0.
\tag{1}
\]
where \(f(X) = \mathbb E(Y|X)\) is the conditional mean of \(Y|X\), and \(\varepsilon\)
is a random variable with vanishing conditional mean, sometimes called the “error term”.
*Parameter estimation* means that we have (somehow) chosen functional forms for
the conditional mean and for the probability distribution of the error term,
and we want to provide estimates for the parameters defining these two functions.

Now, in many applications we actually don’t have much insight about the correct functional form \(f\), nor of the distribution of the error term \(\varepsilon\). Given a dataset of experimental observations of \(Y\) and \(X\), we are thus faced with two tasks:

*Selection.*Choose an adequate model \(\hat M = (\hat f,\,\hat \varepsilon)\) for the true \(f\) and \(\varepsilon\), usually from a (more or less) pre-specified family of initial guesses \(\mathcal M =\{(f_i,\varepsilon_i)\}_i\), using a (more or less) pre-specified criterion.*Post-Selection Inference.*Perform inference with the chosen model. In the study case we’re considering, this amounts to provide confidence intervals for model parameters.

It is, of course, the need to use the same data for both tasks which gives rise to complications.

We now come to the first question raised in the Introduction, regarding the
nature of the inferential target. And now more concretely:
*what are the true values of the parameters we’re trying to estimate?*
One can appreciate that the answer necessarily depends on how we consider the
final output of the modeling procedure:

*(Model Trusting)*As the*true*data generating process, or*(Assumption Lean)*As an*approximation*of the (partially or totally unknown) data generating process, chosen in a data-driven fashion within a family of initial guesses \(\mathcal M\).

According to the first interpretation, there’s no room for ambiguity: the
targets of our estimates should clearly be the true parameter values,
whose definition does not depend on any modeling choice. The second
interpretation, on the other hand, leaves a certain amount of
freedom in this respect. Here, I will follow the point of view advocated by
(Berk et al. 2013), according to which the target parameters
are those providing the *best approximation*^{1} to the true data generating
process, according to the functional form chosen in the selection stage.

I believe both positions have their merits and flaws, and which one is more
appropriate largely depends on context. In a reductionist field like
High Energy Physics, whose eventual goal is to explain the fundamental laws of
Nature, the *Model Trusting* point of view is usually taken,
and with good reason. When studying more emergent phenomena, on the other hand,
the quest for fundamental laws is often meaningless (or at best wishful
thinking), and the *Assumption Lean* standpoint looks more reasonable. In any
case, here the differences are not merely philosophical ones, as the two
interpretations give rise to different mathematical formalisms.

In the following posts, I will be mostly focusing on the *Assumption Lean*
point of view. In my opinion, this has two big advantages^{2}:

*Conceptual:*Inferences have a well-defined meaning even when the model is misspecified^{3}- which, apart from quite particular cases (see above), accounts for the great majority of cases encountered by data analysts in the practice.*Mathematical:*It allows to reduce the problem of*selective*inference to that of*simultaneous*inference (more on this below). For the latter type of problems, the theory of multiple testing readily provides at least conservative bounds.

In addition to the conceptual distinction about the interpretation of the selected model, there is also a technical distinction regarding the type of coverage guarantees that selective confidence intervals should be endowed with (this is the concrete version of the second question posed in the Introduction).

Here are some of the notions of coverage I’ve come across:

*Marginal coverage over the selected parameters.*We bound at level \(\alpha\) the probability that our procedure constructs any non-covering confidence interval for model parameters \(\beta_i\). Denote by \(\widehat M\) the selected model and, with abuse of notation, the corresponding set of selected parameters. If \(\widehat{\text{CI}}_i\) are the confidence intervals for parameters \(\beta _i\), we require: \[ \text{Pr}(\beta _i \in \widehat{\text{CI}}_i\,\,\forall i \in \widehat M) \geq 1-\alpha \tag{2} \]*Conditional coverage over the selected parameters*. We bound at level \(\alpha\) the*conditional*probability of constructing a non-covering confidence interval, conditioned on the outcome of selection \(\widehat M\). If \(m\) is the selected model, we require: \[ \text{Pr}(\beta _i \in \widehat{\text{CI}}_i\,\,\forall i \in m|\,\widehat M=m) \geq 1-\alpha \tag{3} \]*False Coverage Rate*. We bound at level \(q\)^{4}the expected fraction of non-covering confidence intervals out of all intervals constructed: \[ \mathbb E \left( \dfrac{|i \in \widehat M \colon \ \beta_i \in \widehat{\text{CI}}_i|}{|\widehat M|} \right) \geq1-q \tag{4} \] where \(|S|\) denotes the cardinality of a set \(S\).

Notice that the random variables in the previous equations are \(\widehat M\) and \(\widehat{\text{CI}}_i\) (denoted by a hat), whereas the true coefficients \(\beta_i\) and the selected set \(m\) in the case of conditional coverage (Eq. (3)) are fixed quantities. Variations of these measures focusing on single coefficients are also possible.

In practice, in the *Assumption Lean* framework I just introduced,
all these coverage measures would *not* be computed
under the selected model’s probability distribution, but rather under a
pre-fixed, more general model for the true probability distribution of \(Y\)
conditional on \(X\). We may, for instance, assume that the true error term
\(\varepsilon\) in Eq. (1) is gaussian with constant
(\(X\)-independent) variance, without making any further assumption on \(f(X)\).
With enough data, we may even be able to bypass any assumption at all, and
compute all relevant quantiles using a bootstrap (Kuchibhotla et al. 2020).

In the *Model Trusting* framework, on the other hand, the conditional coverage
measure would be computed *under the selected model*… and I’m honestly not
sure whether it’s possible to make sense of the other two measures in this
framework.

The connection between selective and simultaneous inference can now be
understood, through the notion of marginal coverage. In fact, suppose that we
were able to provide simultaneous coverage for *all* parameters
(selected or not):

\[ \text{Pr}(\beta _i \in \widehat{\text{CI}}_i\,\,\forall i) \geq 1-\alpha \tag{5} \] Then, it’s easy to see that the same confidence interval would also provide marginal coverage over the selected parameters. In order to see that, simply observe that the simultaneous coverage event can be decomposed as:

\[ (\beta _i \in \widehat {\text{CI}}_i\,\,\forall i) = (\beta _i \in \widehat{\text{CI}}_i\,\,\forall i \in \widehat M) \cap (\beta _i \in \widehat{\text{CI}}_i\,\,\forall i \notin \widehat M) \] which implies that:

\[
\text{Pr}(\beta _i \in \widehat{\text{CI}}_i\,\,\forall i \in \widehat M) \geq \text{Pr}(\beta _i \in \widehat{\text{CI}}_i\,\,\forall i) \geq 1-\alpha,
\tag{6}
\]
that is simultaneous coverage implies marginal coverage over the
selected parameters. In fact, with a few more set-theory manipulations,
one can arrive to a powerful Lemma (see Kuchibhotla et al. 2020 for details):
controlling the marginal coverage (2) at level \(\alpha\)
for *any* model selection procedure^{5} is equivalent to controlling
simultaneous coverage for all possible model selections.

This provides us a first, very simple recipe for selective inference, which can
be applied whenever one is able to construct confidence intervals for parameters
in the absence of selection: use any procedure (e.g.
Bonferroni corrections)
which controls simultaneous coverage for *all* parameters we may select a priori.

This was a long and somewhat abstract post, so perhaps the best way to conclude is with some bottom lines:

- When performing model-based inference, nothing forces us to make working
hypotheses about the correctness of the model we arrive at. Not making such
assumptions corresponds to what I called an
*Assumption Lean*framework. - In an
*Assumption Lean*framework, the inferential targets are, in general, the best approximations to the truth allowed by the selected model. - There exist many type of coverage guarantees for selective confidence intervals.
- Bounding the probability of any false coverage statement
(“marginal coverage over the selected parameters”) allows to turn a problem of
selective inference into one of
*simultaneous*inference.

In particular, it is worth to mention that the last observation lead us to a simple recipe for constructing (somewhat conservative, but valid) selective confidence intervals with marginal coverage. In the posts which follow, I will discuss some more advanced methods which produce confidence intervals satisfying the requirements discussed here.

Benjamini, Yoav, and Daniel Yekutieli. 2005. “False Discovery Rate–Adjusted Multiple Confidence Intervals for Selected Parameters.” *Journal of the American Statistical Association* 100 (469): 71–81.

Berk, Richard, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao. 2013. “Valid Post-Selection Inference.” *The Annals of Statistics*, 802–37.

Kuchibhotla, Arun K, Lawrence D Brown, Andreas Buja, Junhui Cai, Edward I George, and Linda H Zhao. 2020. “Valid Post-Selection Inference in Model-Free Linear Regression.” *The Annals of Statistics* 48 (5): 2953–81.

Where what’s to be considered

*best*is defined in terms of some reasonable metric. For instance, for the conditional mean \(f(X)\) of a continuous response \(Y\), a convenient target \(f^*(X)\) within a prescribed family of functions \(\mathcal F\) can be defined by \(f^* =\arg\min _{\phi \in \mathcal F} \mathbb E (\vert f(X) - \phi (X)\vert^2)\).↩︎There’s also a third advantage, which is that I find much harder to think about selective inference from the Model Trusting point of view, hence to write blog posts about it - but that’s likely a limitation of my imagination, rather than of the point of view itself.↩︎

A cool word for “wrong”.↩︎

Why \(q\) and not \(\alpha\)? Ask (Benjamini and Yekutieli 2005).↩︎

It is assumed that the selection is performed from a from a

*fixed*family of models \(\mathcal M\).↩︎

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com/vgherard/vgherard.github.io/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

For attribution, please cite this work as

vgherard (2022, Nov. 25). Valerio Gherardi: How to get away with selection. Part II: Mathematical Framework. Retrieved from https://vgherard.github.io/posts/2022-11-07-posi-2/

BibTeX citation

@misc{vgherard2022how, author = {vgherard, }, title = {Valerio Gherardi: How to get away with selection. Part II: Mathematical Framework}, url = {https://vgherard.github.io/posts/2022-11-07-posi-2/}, year = {2022} }