kgrams: Classical k-gram Language Models in R.
Version v0.1.2 of my R package kgrams was just accepted by CRAN. This package provides tools for training and evaluating k-gram language models in R, supporting several probability smoothing techniques, perplexity computations, random text generation and more.
library(kgrams)
# Get k-gram frequency counts from Shakespeare's "Much Ado About Nothing"
freqs <- kgram_freqs(kgrams::much_ado, N = 4)
# Build modified Kneser-Ney 4-gram model, with discount parameters D1, D2, D3.
mkn <- language_model(freqs, smoother = "mkn", D1 = 0.25, D2 = 0.5, D3 = 0.75)
# Sample sentences from the language model at different temperatures
set.seed(840)
sample_sentences(model = mkn, n = 3, max_length = 10, t = 1)
[1] "i have studied eight or nine truly by your office [...] (truncated output)"
[2] "ere you go : <EOS>"
[3] "don pedro welcome signior : <EOS>"
sample_sentences(model = mkn, n = 3, max_length = 10, t = 0.1)
[1] "i will not be sworn but love may transform me [...] (truncated output)"
[2] "i will not fail . <EOS>"
[3] "i will go to benedick and counsel him to fight [...] (truncated output)"
sample_sentences(model = mkn, n = 3, max_length = 10, t = 10)
[1] "july cham's incite start ancientry effect torture tore pains endings [...] (truncated output)"
[2] "lastly gallants happiness publish margaret what by spots commodity wake [...] (truncated output)"
[3] "born all's 'fool' nest praise hurt messina build afar dancing [...] (truncated output)"
verbose
arguments now default to FALSE
.probability()
, perplexity()
and sample_sentences()
are restricted to
accept only language_model
class objects as their model
argument.as_dictionary(NULL)
now returns an empty dictionary
..preprocess
and .tknz_sent
arguments to be ignored in process_sentences()
.max_lines
and batch_size
arguments in kgram_freqs.connection()
.dictionary
.dictionary()
with batch processing and
non-trivial size constraints on vocabulary size.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com/vgherard/vgherard.github.io/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Gherardi (2021, Nov. 13). vgherard: kgrams v0.1.2 on CRAN. Retrieved from https://vgherard.github.io/posts/2021-11-13-kgrams-v012-released/
BibTeX citation
@misc{gherardi2021kgrams, author = {Gherardi, Valerio}, title = {vgherard: kgrams v0.1.2 on CRAN}, url = {https://vgherard.github.io/posts/2021-11-13-kgrams-v012-released/}, year = {2021} }