Build a k-gram language model.

Principal methods supported by objects of class language_model

language_model(object, ...)

# S3 method for language_model
language_model(object, ...)

# S3 method for kgram_freqs
language_model(object, smoother = "ml", N = param(object, "N"), ...)

Arguments

object

an object which stores the information required to build the k-gram model. At present, necessarily a kgram_freqs object, or a language_model object of which a copy is desired (see Details).

...

possible additional parameters required by the smoother.

smoother

a length one character vector. Indicates the smoothing technique to be applied to compute k-gram continuation probabilities. A list of available smoothers can be obtained with smoothers(), and further information on a particular smoother through info().

N

a length one integer. Maximum order of k-grams to use in the language model. This muss be less than or equal to the order of the underlying kgram_freqs object.

Value

A language_model object.

Details

These generics are used to construct objects of class language_model. The language_model method is only needed to create copies of language_model objects (that is to say, new copies which are not altered by methods which modify the original object in place, see e.g. parameters). The discussion below focuses on language models and the kgram_freqs method.

kgrams supports several k-gram language models, including Interpolated Kneser-Ney, Stupid Backoff and others (see smoothers). The objects created by language_models() have methods for computing word continuation and sentence probabilities (see probability), random text generation (see sample_sentences) and other type of language modeling tasks such as computing perplexities and word prediction accuracies.

Smoothers have often tuning parameters, which need to be specified by (exact) name through the ... arguments; otherwise, language_model() will use default values and, once per session, throw a warning. info(smoother) lists all parameters needed by a specific smoother, together with their allowed parameter space.

The run-time of language_model() may vary substantially for different smoothing methods, depending on whether or not a method requires the computation of additional quantities (that is to say, beyond k-gram counts) for its operativity (this is, for instance, the case for the Kneser-Ney smoother).

Author

Valerio Gherardi

Examples

# Create an interpolated Kneser-Ney 2-gram language model

freqs <- kgram_freqs("a a b a a b a b a b a b", 2)
model <- language_model(freqs, "kn", D = 0.5)
model
#> A k-gram language model.
summary(model)
#> A k-gram language model.
#> 
#> Smoother:
#> * 'kn'.
#> 
#> Parameters:
#> * N: 2
#> * V: 2
#> * D: 0.5
#> 
#> Number of words in training corpus:
#> * W: 13
#> 
#> Number of distinct k-grams with positive counts:
#> * 1-grams:4
#> * 2-grams:5
probability("a" %|% "b", model)
#> [1] 0.815

# For more examples, see ?probability, ?sample_sentences and ?perplexity.