Some notes on the causes of overdispersion in count data.

I came across some confusing statements regarding how *overdispersion* can arise in binary or count data, such as the typical capture-recapture data encountered in Biology and Ecology. The term generally refers to a variance inflation with respect to the prediction of a specific statistical model (or family of models) for the data generating process under study.
Such extra variability is sometimes ascribed to “inhomogeneities in the population”, a phrase that is not very precise from a mathematical point of view, and can be misleading without further qualification.

Let’s get to the point. Consider a binomial model: \[ B_{N,\,p}(k) = \binom{N}{k}p^k(1-p)^{N-k}\tag{1}. \] For concreteness, imagine that we are studying survival in a population of animals, and Eq. (1) is proposed to model the survival probability of an initial cohort of \(N\) individuals (from one year to the next, say).

Clearly, we should not expect \(p\) to represent the survival probability for each single animal. In fact are a lot of factors that are only determined at the level of the individual and that could realistically affect survival: age, sex, weight, *etc.*. If we are not including any individual variable in our analysis (because they were not measured, for example) we may model the survival probability \(q\) of any individual by a probability distribution
\(\text d \pi(q)\) with mean \(\bar q = \int q\,\text d \pi (q)\). If we further assume that survival probabilities of several individuals are independent and identically distributed (i.i.d.), we can derive explicitly the probability of \(k\) survivals out of \(N\) initial individuals:

\[ \begin{split} \text{Pr}(k) &= \intop \text d\pi(q_1)\text d\pi (q_2)\cdots \text d\pi(q_N)\sum _{y\in E_{N,k}}\prod_{i=1}^Nq_i^{y_i}(1-q_i)^{(1-y_i)}\\ & = \sum _{y\in E_{N,k}}\prod_{i=1}^N \left(\intop \text d \pi(q_i)q_i^{y_i}(1-q_i)^{(1-y_i)}\right) \\ & = \sum _{y\in E_{N,k}}\prod_{i=1}^N \bar q^{y_i}(1-\bar q)^{1-y_i} \\ & = B_{N,\bar q}(k) \end{split}\tag{2} \] where we have denoted by \(E_{N,k}\) the set of \(Y\in \{0,\,1\}^N\) such that \(\sum _{i=1}^{N}Y_i=k\). Notice that the integrand in the first line of (2) is the conditional probability of \(k\) successes out of \(N\) trials with probabilities for the individual trials given by \(q_1,\,q_2,\,\dots,\,q_N\).

In other words, assuming that the survival probabilities of individuals are i.i.d. according to \(\text d \pi (q)\), we see that the binomial distribution (1) holds exactly with \(p = \intop \text d \pi (q) q\) for the unconditional (on individual level covariates) distribution of survivals. *A fortiori*, no overdispersion with respect to the binomial variance, *i.e.* \(Np(1-p)\), is possible under these assumptions.

Let us examine a bit more in detail the i.i.d. assumption. First of all, we observe that “identically distributed” is not the same as “identical”, which would be the case if \(\text d \pi (q) = \delta (q - \bar q)\text dq\). Quite the contrary, the purpose of \(\text d \pi (q)\) is exactly to reflect the variability of \(q\) in the overall population. On the other hand, assuming all individuals are sampled from the *same* population, the distribution \(\text d \pi\) is simply the result of such a sampling scheme, and it doesn’t really make sense to consider different distributions for different individuals. The only case in which we should use different \(\text d \pi _i(q_i)\) distributions is if our experimental design involved systematically sampling individuals from distinct populations and putting them together into a single cohort (*e.g.* we always start with \(\frac{N}{2}\) individuals from population \(A\) and \(\frac{N}{2}\) from population \(B\)). Finally, if the analysis included some individual covariate, such as *sex* or *age*, all the discussion would remain valid, with unconditional survival probabilities replaced by conditional (on *sex* and *age*) probabilities.

The rather strong assumption is, instead, independence. How could independence be violated? Suppose there is some set of variables \(X\) not included in the analysis, which globally affect survival for all individuals - in our example \(X\) may include for instance things like *food availability* and *metereological conditions*. Suppose, further, that survival probabilities are actually i.i.d. *conditional on* \(X\), with joint distribution:

\[ \text d \Pi(q_1,q_2,\dots, q_N\vert X)=\text d\pi(q_1\vert X)\text d\pi(q_2\vert X)\cdots\text d\pi (q_N\vert X)\tag{3} \] Then, unconditionally:

\[ \text d \Pi(q_1,q_2,\dots, q_N)=\intop \text dF(X)\,\text d\pi(q_1\vert X)\text d\pi(q_2\vert X)\cdots\text d\pi (q_N\vert X),\tag{4} \]

where \(\text d F(X)\) is the marginal distribution of \(X\). Crucially, this is in general not a product measure, and (looking back at our derivation, Eq. (2)) we see that this dependence can indeed change the form of the resulting distribution - and lead to overdispersion with respect to the binomial expectation, in particular.

I think the discussion above clearly shows that binomial overdispersion is not caused by inhomogeneities in the population, if these are understood as random (patternless) variations at the individual level. Quite the contrary, what can easily make data look non-binomial is the presence of unobserved global factors that can change randomly between experimental repetitions, and influence (or simply correlate with) survival probability at the population level.

A few concluding remarks:

From a pure mathematical point of view, in the limit in which initial cohorts are sampled from a single, infinite population, the validity of the i.i.d. assumption is guaranteed by De Finetti’s theorem on infinite exchangeable sequences (in the finite case there are also guarantees of approximate validity). Clearly, if the experimental design involves sampling individuals from several populations in a systematic way, the resulting sequence of Bernoulli variables (alive/dead) is not exchangeable.

If the De Finetti measure in the previous point can change between different cohort releases, depending on some random and unmeasured parameter \(X\), this will effectively lead to the same kind of dependence between individual probability parameters illustrated above.

Another form in which dependence may arise is when survival of one individual may influence, or simply correlate with, survival of other individuals. Imagine, for instance, that we may only observe individuals in pairs. Mathematically, this will again manifest in the form of non-exchangeability.

**Further references:**

Cox, D. R., and E. J. Snell. 1989. *Analysis of Binary Data, Second Edition*. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis.

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com/vgherard/vgherard.github.io/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

For attribution, please cite this work as

Gherardi (2024, March 6). vgherard: No binomial overdispersion from variations at the individual level. Retrieved from https://vgherard.github.io/posts/2024-03-06-no-binomial-overdispersion-from-variations-at-the-individual-level/

BibTeX citation

@misc{gherardi2024no, author = {Gherardi, Valerio}, title = {vgherard: No binomial overdispersion from variations at the individual level}, url = {https://vgherard.github.io/posts/2024-03-06-no-binomial-overdispersion-from-variations-at-the-individual-level/}, year = {2024} }