#' @param q_doc character. Text document whose authorship is questioned.
#' @param auth_corpus character. Text corpus of claimed author.
#' @param imp_corpus character. Text corpus of impostors.
#' @param n_imp a positive number. Number of "impostor" simulations.
<- function(q_doc, auth_corpus, imp_corpus, n_imp = 100)
score
{<- extract_grammar(q_doc)
q_doc <- extract_grammar(auth_corpus)
auth_corpus <- extract_grammar(imp_corpus)
imp_corpus
# Compute perplexity based on claimed author's language model.
<- train_language_model(auth_corpus)
auth_mod <- kgrams::perplexity(q_doc, model = auth_mod)
auth_perp
# Compute perplexity based on impostor language models.
#
# Each impostor is trained on a synthetic corpus obtained by sampling from
# the impostor corpus the same number of sentences as the corpus of the
# claimed author.
<- length(kgrams::tknz_sent(auth_corpus))
n_sents_auth <- kgrams::tknz_sent(imp_corpus)
imp_corpus_sentences <- replicate(n_imp, {
imp_mod sample(imp_corpus_sentences, n_sents_auth) |> train_language_model()
})<- sapply(imp_mod, \(m) kgrams::perplexity(q_doc, model = m))
imp_perp
# Score is the fraction of impostor models that perform worse (higher
# perplexity) than the proposed authors language model
<- mean(auth_perp < imp_perp)
score
return(score)
}
<- function(text)
train_language_model
{|>
text ::kgram_freqs(N = 10, .tknz_sent = kgrams::tknz_sent) |>
kgrams::language_model(smoother = "kn", D = 0.75)
kgrams
}
<- identity # Just a placeholder - see above. extract_grammar
About a month ago we finally managed to drop (Nini et al. 2024), “Authorship Verification based on the Likelihood Ratio of Grammar Models”, on the arXiv. Delving into topics such as authorship verification, grammar and forensics, was quite a detour for me, and I’d like to summarize here some of the ideas and learnings I got from working with all this new and interesting material.
The main qualitative idea put forward by Ref. (Nini et al. 2024) is that grammar is a fundamentally personal and unique trait of an individual, therefore providing a sort of “behavioural biometric”. One first goal of this work was to put this general principle under test, by applying it to the problem of Authorship Verification (AV): the process of validating whether a certain document was written by a claimed author. Concretely, we built an algorithm for AV that relies entirely on the grammatical features of the examined textual data, and compared it with the state-of-the-art methods for AV.
The results were very encouraging. In fact, our method actually turned out to be generally superior to the previous state-of-the-art on the benchmarks we examined. This is a notable result, keeping also into account that our method uses less textual information (only the grammar part) than other methods to perform its inferences.
The algorithm
I sketch here a pseudo-implementation of our method in R. For the fit of \(k\)-gram models and perplexity computations, I use my package {kgrams}
, which can be installed from CRAN. Model (hyper)parameters such as number of impostors, order of the \(k\)-gram models, etc. are hardcoded, see (Nini et al. 2024) for details.
This is just for illustrating the essence of the method. For practical reasons, in the code chunk below I’m not reproducing the definition of the function extract_grammar()
, which in our work is embodied by the POS-noise algorithm. This function should transform a regular sentence, such as “He wrote a sentence”, to its underlying grammatical structure, say “[Pronoun] [verb] a [noun]”.
To be used as follows:
<- "a a b a. b a. c b a. b a b. a."
q_doc <- "a a b a b. b c b. a b c a. b a. b c a."
auth_corpus <- "a a. b. a. b a. b a. c. a b a. d. a b. a d. a b a b c b a."
imp_corpus
set.seed(840)
score(q_doc, auth_corpus, imp_corpus)
[1] 0.89
The “score” computed by this algorithm turns out to be a good truthfulness predictor for the claimed authorship, higher scores being correlated with true attributions. If the impostor corpus is fixed once and for all, and if the pairs q_doc
and auth_corpus
are randomly sampled from a fixed joint distribution, we can set a threshold for score in such a way that the attribution criterion score > threshold
maximizes some objective such as accuracy. This is, more or less, what we studied quantitatively in our paper.
References
Reuse
Citation
@online{gherardi2024,
author = {Gherardi, Valerio},
title = {Grammar as a Biometric for {Authorship} {Verification}},
date = {2024-04-25},
url = {https://vgherard.github.io/posts/2024-04-25-grammar-as-a-biometric-for-authorship-verification/},
langid = {en}
}