Build a k-gram language model.
language_model
probability()
: compute word continuation and sentence probabilities.
See probability.
sample_sentences()
: generate random text by sampling from the
language model probability distribution at arbitary temperature. See
sample_sentences.
perplexity()
: Compute the language model perplexity on a test
corpus. See perplexity.
language_model(object, ...)
# S3 method for class 'language_model'
language_model(object, ...)
# S3 method for class 'kgram_freqs'
language_model(object, smoother = "ml", N = param(object, "N"), ...)
an object which stores the information required to build the
k-gram model. At present, necessarily a kgram_freqs
object, or a
language_model
object of which a copy is desired (see Details).
possible additional parameters required by the smoother.
a length one character vector. Indicates the smoothing
technique to be applied to compute k-gram continuation probabilities. A list
of available smoothers can be obtained with smoothers()
, and
further information on a particular smoother through
info()
.
a length one integer. Maximum order of k-grams to use in the language
model. This muss be less than or equal to the order of the underlying
kgram_freqs
object.
A language_model
object.
These generics are used to construct objects of class language_model
.
The language_model
method is only needed to create copies of
language_model
objects (that is to say, new copies which are not
altered by methods which modify the original object in place,
see e.g. parameters). The discussion below focuses on
language models and the kgram_freqs
method.
kgrams supports several k-gram language models, including
Interpolated Kneser-Ney, Stupid Backoff and others
(see smoothers). The objects created by
language_models()
have methods for computing word continuation and
sentence probabilities (see probability),
random text generation (see sample_sentences)
and other type of language modeling tasks such as computing perplexities and
word prediction accuracies.
Smoothers have often tuning parameters, which need to be specified by
(exact) name through the ...
arguments; otherwise,
language_model()
will use default values and, once per session, throw
a warning. info(smoother)
lists all parameters needed by a
specific smoother, together with their allowed parameter space.
The run-time of language_model()
may vary substantially for different
smoothing methods, depending on whether or not a method requires the
computation of additional quantities (that is to say, beyond k-gram counts)
for its operativity (this is, for instance, the case for the Kneser-Ney
smoother).
# Create an interpolated Kneser-Ney 2-gram language model
freqs <- kgram_freqs("a a b a a b a b a b a b", 2)
model <- language_model(freqs, "kn", D = 0.5)
model
#> A k-gram language model.
summary(model)
#> A k-gram language model.
#>
#> Smoother:
#> * 'kn'.
#>
#> Parameters:
#> * N: 2
#> * V: 2
#> * D: 0.5
#>
#> Number of words in training corpus:
#> * W: 13
#>
#> Number of distinct k-grams with positive counts:
#> * 1-grams:4
#> * 2-grams:5
probability("a" %|% "b", model)
#> [1] 0.815
# For more examples, see ?probability, ?sample_sentences and ?perplexity.