Sample sentences from a language model's probability distribution.

sample_sentences(model, n, max_length, t = 1)

Arguments

model

an object of class language_model.

n

an integer. Number of sentences to sample.

max_length

an integer. Maximum length of sampled sentences.

t

a positive number. Sampling temperature (optional); see Details.

Value

a character vector of length n. Random sentences generated from the language model's distribution.

Details

This function samples sentences according the prescribed language model's probability distribution, with an optional temperature parameter. The temperature transform of a probability distribution is defined by p(t) = exp(log(p) / t) / Z(t) where Z(t) is the partition function, fixed by the normalization condition sum(p(t)) = 1.

Sampling is performed word by word, using the already sampled string as context, starting from the Begin-Of-Sentence context (i.e. N - 1 BOS tokens). Sampling stops either when an End-Of-Sentence token is encountered, or when the string exceeds max_length, in which case a truncated output is returned.

Some language models may give a non-zero probability to the the Unknown word token, but this is never produced in text generated by sample_sentences(): when randomly sampled, it is simply ignored.

Finally, a word of caution on some special smoothers: "sbo" smoother (Stupid Backoff), does not produce normalized continuation probabilities, but rather continuation scores. Sampling is here performed by assuming that Stupid Backoff scores are proportional to actual probabilities. 'ml' smoother (Maximum Likelihood) does not assign probabilities when the k-gram count of the context is zero. When this happens, the next word is chosen uniformly at random from the model's dictionary.

Author

Valerio Gherardi

Examples

# Sample sentences from 8-gram Kneser-Ney model trained on Shakespeare's
# "Much Ado About Nothing"

# \donttest{

### Prepare the model and set seed
freqs <- kgram_freqs(much_ado, 8, .tknz_sent = tknz_sent)
model <- language_model(freqs, "kn", D = 0.75)
set.seed(840)

sample_sentences(model, n = 3, max_length = 10)
#> [1] "we are now to the prince else by thy side [...] (truncated output)"         
#> [2] "don pedro leonato and his brother <EOS>"                                    
#> [3] "yet stand close as you know my inwardness and love [...] (truncated output)"

### Sampling at high temperature
sample_sentences(model, n = 3, max_length = 10, t = 100)
#> [1] "however villanies i cousins 'father armour signify down remnants ancientry [...] (truncated output)"
#> [2] "title iv hanged wring pestilence large hanged light blazon arrant [...] (truncated output)"         
#> [3] "woo soul afar slept sick decerns beside truant neighbour slanders [...] (truncated output)"         

### Sampling at low temperature
sample_sentences(model, n = 3, max_length = 10, t = 0.01)
#> [1] "i will go before and show him their examination <EOS>"                 
#> [2] "i will not be sworn but love may transform me [...] (truncated output)"
#> [3] "i will not be sworn but love may transform me [...] (truncated output)"

# }