I just came across (Kerridge 1963), an old result which falls under the umbrella of “frequentist properties of Bayesian inference”. Specifically, the theorem proved in this reference applies to sequential testing, a context in which the mechanics of Bayesian inference, with its typical sequential updates, may be regarded as natural.
Suppose we wish to compare two hypotheses \(H_0\) and \(H_1\), where \(H_0\) is simple1. We start collecting data until our sample meets some specific requirement, according to some given stopping rule \(S\). If this ever occurs, we compute the Bayes factor:
\[ B = \frac{\text{Pr}(\text {data} \vert H_0)}{\text{Pr}(\text {data} \vert H_1)}. \tag{1}\]
and reject \(H_0\) if \(B \leq b\), for some \(b > 0\). The theorem is that if \(H_0\) is the true data generating process, the above procedure has a false rejection rate lower than \(b\), independently of the stopping rule employed to end sampling.
Notice that the rejection event is composed by two parts:
- Sampling has stopped at some point during data taking.
- When sampling stopped, \(B \leq b\) held.
We also note that the stopping rule needs not be deterministic, although this appears to be implicitly assumed in the original reference. In general, the data collected up to a certain point will only determine the probability that sampling stops at that time (and, to reinforce the previous point, the sum of these probabilities will not, in general, add up to \(1\)).
In order to prove this theorem, let us set up some notation. Let \((X_n)_{n\in \mathbb N}\) be some stochastic process representing “data”, where each \(X_n \in \mathcal X\) is a data point. We denote by \(P^{(0)}\) the probability distribution of \(X\) under \(H_0\), which is completely defined since \(H_0\) is simple. We further denote by \(P_n ^{(0)}\) the corresponding probability measure on \(\mathcal X ^n\) for the set of the first \(n\) observations \(X_1,\,X_2,\,\dots, \,X_n\).
We first consider the case in which \(H_1\) is also simple, and denote by \(P^{(1)}\) and \(P^{(1)}_n\) the corresponding measures. The Bayes factor is defined as the Radon-Nikodym derivative:
\[ B_n \equiv \frac{\text d P^{(0)}_n}{\text d P_n^{(1)}} \tag{2}\]
(we assume regularity conditions so that such a derivative exists).
Also, we assume for the moment that the stopping rule is deterministic, embodied by binary functions \(S_n=S(X_1,\,X_2,\,\dots,X_n)\) of the first \(n\) observations, with \(S_n = 1\) if sampling can stop at step \(n\).
Now fix \(b>0\). A rejection of \(H_0\) at sampling step \(n\) is represented by the event:
\[ \mathcal R _{n}(b)\equiv \{B_n\leq b,\,S_n=1,\,S_i=0\,\text{ for }i<n\}, \tag{3}\]
which, with abuse of notation, we may identify with a subset of \(\mathcal X ^n\). The overall rejection event (at any sampling step) is given by:
\[ \mathcal R (b)\equiv \bigcup _{n=1} ^\infty \mathcal R_n(b), \tag{4}\]
so that our theorem amounts to the bound:
\[ \text{Pr}_{H_0}(\mathcal R(b))\leq b. \tag{5}\]
In order to prove this, we first note that:
\[ \text{Pr}_{H_0}(\mathcal R _n(b))= \intop _{\mathcal R _n(b)} \text d P_n= \intop _{\mathcal R _n(b)}B_n \text d Q_n \leq b\intop _{\mathcal R _n(b)} \text d Q_n=b\cdot\text{Pr}_{H_1}(\mathcal R _n(b)). \tag{6}\]
Hence, since the events \(\mathcal R _n(b)\) and \(\mathcal R _m(b)\) are clearly disjoint for \(n\neq m\), we have:
\[ \text{Pr}_{H_0}(\mathcal R(b))\leq b\cdot\text{Pr}_{H_1}(\mathcal R (b)), \tag{7}\]
which, since \(\text{Pr}_Q(\cdot)\leq1\), implies Equation 5.
We may relax the assumption that the alternative hypothesis is simple, by considering a parametric family of measures \((P^{(1)}_\theta)_{\theta \in \Theta}\), where the parameter \(\theta\) has some prior probability \(\text d\Phi(\theta)\). The argument given above still applies to this case, if \(P^{(1)}\) is replaced by the mixture \(P^{(1)} = \intop \text d \Phi(\theta) P^{(1)}_\theta\) (under appropriate regularity assumptions). In the notation of Equation 1, the denominator \(\text {Pr}(\text {data} \vert H_1)\equiv \intop \text d \Phi(\theta)\,\text{Pr}(\text{data} \vert H_{1,\theta})\).
Finally, in order to lift the assumption that our stopping rule is deterministic, let us first consider the following special (deterministic) stopping rule:
\[ S^*_n =1\iff B_n \leq b. \tag{8}\]
In other words, we stop sampling whenever the sample would reject \(H_0\) according to \(B_n \leq b\). The rejection event \(\mathcal R(b)\) for this special stopping rule is simply:
\[ \mathcal R^*(b) \equiv \{B_n \leq b\text{ for some }n\in \mathbb N\}. \tag{9}\]
Since we already proved the theorem for any deterministic stopping rule, Equation 5 implies:
\[ \text {Pr}_{H_0}(\mathcal R^*(b)) \leq b. \tag{10}\]
But Equation 10 clearly implies the theorem for any stopping rule, deterministic or not, since in general:
\[ \mathcal R(b) \subseteq \mathcal R^*(b) \tag{11}\]
(we need \(B_n\leq b\) to hold for some \(n\in \mathbb N\) in order to reject \(H_0\)).
Interestingly, the argument just given leads to a more accurate statement of our main result Equation 5:
\[ \text{Pr}_{H_0}(\mathcal R(b))\leq \text {Pr}_{H_0}(B_n \leq b\text{ for some }n\in \mathbb N) \leq b, \tag{12}\]
where the leftmost quantity is the false rejection rate of a selective testing procedure, such as the one we have been considering so far, wheareas the central quantity is the false rejection rate of a simultaneous testing procedure (that checks whether \(B_n \leq b\) at each step of sampling). What’s happening here is analogous to a phenomenon observed in the context of parameter estimation following model selection (Berk et al. 2013), where one can show that, in order to guarantee marginal coverage for the selected parameters, if the selection rule is allowed to be completely arbitrary one must actually require simultaneous coverage for all possible parameters.
To conclude the post, let us remark that Equation 5 was originally formulated in terms of the posterior probability \(Q_n(\pi)\) of \(H_0\):
\[ Q_n(\pi) = \frac{\pi }{\pi +(1-\pi)B^{-1}_n}, \tag{13}\]
where \(\pi\) and \(1-\pi\) are the prior probabilities of the two competing models \(H_0\) and \(H_1\), respectively. We may use \(Q_n(\pi) \leq q\), rather than \(B_n \leq b\), as the relevant criterion for rejecting \(H_0\). From the pure frequentist point of view, this doesn’t add anything to our formulation in terms of the Bayes ratio, as \(Q_n(\pi)\leq q\) is equivalent to \(B_n \leq b\) as long as \(b = \frac{q}{1-q}\frac{1-\pi}{\pi}\). In particular, the bound analogous to Equation 5 reads:
\[ \text{Pr}_{H_0}(\mathcal R(q))\leq \text {Pr}_{H_0}(Q_n(\pi) \leq q\text{ for some }n\in \mathbb N) \leq \frac{q}{1-q}\frac{1-\pi}{\pi}. \tag{14}\]
References
Footnotes
This is a technical term, meaning that \(H_0\) completely characterizes the probability distribution of data. An example of a non-simple hypothesis would be a parametric model depending on some unknown parameter \(\theta\).↩︎
Reuse
Citation
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {Frequentist Bounds for {Bayesian} Sequential Hypothesis
Testing},
date = {2024-05-22},
url = {https://vgherard.github.io/posts/2024-05-22-frequentist-bounds-for-bayesian-sequential-hypothesis-testing/},
langid = {en}
}