# Tidyverse facilities for plotting
library(dplyr)
library(ggplot2)
# Loss functions
<- function(par, data, w) {
weighted_loss <- par[[1]]
m <- par[[2]]
q <- data$x
x <- data$y
y
<- m * x + q
z <- 1 / (1 + exp(-z))
p
-mean(y * w(y) * log(p) + (1-y) * w(1-y) * log(1-p))
}
<- function(par, data)
cross_entropy weighted_loss(par, data, w = \(y) 1)
<- function(par, data)
cllr weighted_loss(par, data, w = \(y) mean(1-y))
# Data generating process
<- function(n, pi = .001, mu1 = 1, mu0 = 0, sd1 = 1, sd0 = 0.25) {
rxy <- runif(n) < pi
y <- rnorm(n, mean = y * mu1 + (1-y) * mu0, sd = y * sd1 + (1-y) * sd0)
x data.frame(x = x, y = y)
}<- formals(rxy)$pi
pi
# Simulation
set.seed(840)
<- rxy(n = 1e6)
data <- optim(c(1,0), cllr, data = data)$par
par_cllr <- optim(c(1,0), cross_entropy, data = data)$par
par_cross_entropy 2] <- par_cross_entropy[2] - log(pi / (1-pi))
par_cross_entropy[
# Helpers to extract LLRs from models
<- function(x, par)
llr 1] * x + par[2]
par[
<- function(x) {
llr_true <- formals(rxy)$mu1
mu1 <- formals(rxy)$mu0
mu0 <- formals(rxy)$sd1
sd1 <- formals(rxy)$sd0
sd0
<- 0.5 * (sd1 ^2 - sd0 ^2) / (sd1 ^2 * sd0 ^2)
a <- mu1 / (sd1^2) - mu0 / (sd0^2)
b <- 0.5 * (mu0^2 / (sd0^2) - mu1^2 / (sd1^2)) + log(sd0 / sd1)
c * x * x + b * x + c
a }
Intro
During the last few months, I’ve been working on a machine learning algorithm with applications in Forensic Science, a.k.a. Criminalistics. In this field, one common task for the data analyst is to present the trier-of-fact (the person or people who determine the facts in a legal proceeding) with a numerical assessment of the strength of the evidence provided by available data towards different hypotheses. In more familiar terms, the forensic expert is responsible of computing the likelihoods (or likelihood ratios) of data under competing hypotheses, which are then used by the trier-of-fact to produce Bayesian posterior probabilities for the hypotheses in question1.
In relation to this, forensic scientists have developed a bunch of techniques to evaluate the performance of a likelihood ratio model in discriminating between two alternative hypothesis. In particular, I have come across the so called Likelihood Ratio Cost, usually defined as:
The main reason for writing this note was to understand a bit better what it means to optimize Equation 1, which does not look immediately obvious to me from its definition2. In particular: is the population minimizer of Equation 1 the actual likelihood ratio? And in what sense is a model with lower
The short answers to these questions are: yes; and:
Cross-entropy with random weights
We start with a mathematical digression, which will turn out useful for further developments. Let
where
We now look for the population minimizer of Equation 2, i.e. the function
The corresponding expected loss is:
Before looking at values of
where
and
where
In terms of
If now
Putting everything together, we can decompose the expected loss for a function
where
The three components in Equation 8 can be interpreted as follows:
All the information-theoretic quantities (and their corresponding operative interpretations hinted in the previous paragraph) make reference to the measure
A familiar case: cross-entropy loss
For
From Equation 6 we see that the measure
where conditional entropy
The Likelihood Ratio Cost
The quantity
We can easily compute8:
so that, by Equation 3, the population minimizer of
where
The constant
The general decomposition Equation 8 becomes:
where
Discussion
The table below provides a comparison between cross-entropy and likelihood-ratio cost, summarizing the results from previous sections.
Cross-entropy | Likelihood Ratio Cost | |
---|---|---|
Posterior odds ratio | Likelihood ratio | |
Minimum Loss | ||
Processing Loss | ||
Misspecification Loss | ||
Reference measure |
The objective of
Suppose we are given a set of predictive models
The previous argument carries over mutatis mutandi to
The measure
The fact that
Simulated example
In general, the posterior odd ratio and likelihood ratio differ only by a constant, so it is reasonable to try to fit the same functional form to both of them. Let us illustrate with a simulated example of this type the differences between cross-entropy and
Suppose that
Suppose that we fit an exponential function
The chunk of R code below defines the function and data used for the simulation. In particular, I’m considering a heavily unbalanced case (
So, what do our best estimates look like? The plot below shows the best fit lines for the log-likelihood ratio from
ggplot() +
geom_function(fun = \(x) llr(x, par_cllr), color = "red") +
geom_function(fun = \(x) llr(x, par_cross_entropy), color = "blue") +
geom_function(fun = \(x) llr_true(x), color = "black") +
geom_hline(aes(yintercept = 0), linetype = "dashed", color = "red") +
geom_hline(aes(yintercept = -log(pi / (1-pi))),
linetype = "dashed", color = "blue") +
ylim(c(-10,10)) + xlim(c(-1, 2)) +
xlab("X") + ylab("Log-Likelihood Ratio")
The reason why the lines differ is that they are designed to solve a different predictive problem: as we’ve argued above, minimizing
<- bind_rows(
test_data rxy(n = 1e6, pi = 0.5) |> mutate(type = "Balanced", llr_thresh = 0),
rxy(n = 1e6) |> mutate(type = "Unbalanced", llr_thresh = -log(pi / (1-pi)))
)
|>
test_data ggplot(aes(x = x, fill = y)) +
geom_histogram(bins = 100) +
facet_grid(type ~ ., scales = "free_y") +
xlim(c(-2, 4))
These differences are reflected in the misclassification rates of the resulting classifiers defined by
|>
test_data mutate(
llr_cllr = llr(x, par_cllr),
llr_cross_entropy = llr(x, par_cross_entropy),
llr_true = llr_true(x)
|>
) group_by(type) |>
summarise(
cllr = 1 - mean((llr_cllr > llr_thresh) == y),
cross_entropy = 1 - mean((llr_cross_entropy > llr_thresh) == y),
true_llr = 1 - mean((llr_true > llr_thresh) == y)
)
# A tibble: 2 × 4
type cllr cross_entropy true_llr
<chr> <dbl> <dbl> <dbl>
1 Balanced 0.166 0.185 0.140
2 Unbalanced 0.000994 0.000637 0.000518
Final remarks
Our main conclusion in a nutshell is that
References
Footnotes
This is how I understood things should theoretically work, from discussions with friends who are actually working on this field. I have no idea on how much day-to-day practice comes close to this mathematical ideal, and whether there exist alternative frameworks to the one I have just described.↩︎
The Likelihood Ratio Cost was introduced in (Brümmer and du Preez 2006). The reference looks very complete, but I find its notation and terminology so unfamiliar that I decided to do my own investigation and leave this reading for a second moment.↩︎
That is to say,
for any permutation of the set .↩︎Nota bene: the function
is here assumed to be fixed, whereas the randomness in the quantity only comes from the paired observations .↩︎Notice that, due to symmetry
, which might be easier to compute.↩︎Here and below I relax a bit the notation, as most details should be clear from context.↩︎
The quantity
is not defined when all ’s are zero, as the right-hand side of Equation 1 itself. In this case, we make the convention .↩︎For the original loss in Equation 1, without the modification discussed above, the result would have been
↩︎Formally, given an i.i.d. stochastic process
, we can define a new stochastic process such that if , and (not defined) otherwise. Discarding values, we obtain an i.i.d. stochastic process whose individual observations are distributed according to .↩︎There is another case in which
and cross-entropy minimization converge to the same answer as : when used for model selection among a class of models for the likelihood or posterior odds ratio that contains their correct functional form.↩︎This is just logistic regression. It could be a reasonable approximation if
, which however I will assume below to be badly violated.↩︎
Reuse
Citation
@online{gherardi2023,
author = {Gherardi, Valerio},
title = {Interpreting the {Likelihood} {Ratio} Cost},
date = {2023-11-15},
url = {https://vgherard.github.io/posts/2023-11-15-interpreting-the-likelihood-ratio-cost/},
langid = {en}
}