kgrams v0.1.2 on CRAN

kgrams: Classical k-gram Language Models in R.

Natural Language Processing
R
Author
Published

November 13, 2021

Summary

Version v0.1.2 of my R package kgrams was just accepted by CRAN. This package provides tools for training and evaluating k-gram language models in R, supporting several probability smoothing techniques, perplexity computations, random text generation and more.

Short demo

library(kgrams)
# Get k-gram frequency counts from Shakespeare's "Much Ado About Nothing"
freqs <- kgram_freqs(kgrams::much_ado, N = 4)

# Build modified Kneser-Ney 4-gram model, with discount parameters D1, D2, D3.
mkn <- language_model(freqs, smoother = "mkn", D1 = 0.25, D2 = 0.5, D3 = 0.75)

# Sample sentences from the language model at different temperatures
set.seed(840)
sample_sentences(model = mkn, n = 3, max_length = 10, t = 1)
[1] "i have studied eight or nine truly by your office [...] (truncated output)"
[2] "ere you go : <EOS>"                                                        
[3] "don pedro welcome signior : <EOS>"                                         
sample_sentences(model = mkn, n = 3, max_length = 10, t = 0.1)
[1] "i will not be sworn but love may transform me [...] (truncated output)" 
[2] "i will not fail . <EOS>"                                                
[3] "i will go to benedick and counsel him to fight [...] (truncated output)"
sample_sentences(model = mkn, n = 3, max_length = 10, t = 10)
[1] "july cham's incite start ancientry effect torture tore pains endings [...] (truncated output)"   
[2] "lastly gallants happiness publish margaret what by spots commodity wake [...] (truncated output)"
[3] "born all's 'fool' nest praise hurt messina build afar dancing [...] (truncated output)"          

NEWS

Overall Software Improvements

  • The package’s test suite has been greatly extended.
  • Improved error/warning conditions for wrong arguments.
  • Re-enabled compiler diagnostics as per CRAN policy (#19)

API Changes

  • verbose arguments now default to FALSE.
  • probability(), perplexity() and sample_sentences() are restricted to accept only language_model class objects as their model argument.

New features

  • as_dictionary(NULL) now returns an empty dictionary.

Bug Fixes

  • Fixed bug causing .preprocess and .tknz_sent arguments to be ignored in process_sentences().
  • Fixed previously wrong defaults for max_lines and batch_size arguments in kgram_freqs.connection().
  • Added print method for class dictionary.
  • Fixed bug causing invalid results in dictionary() with batch processing and non-trivial size constraints on vocabulary size.

Other

  • Maintainer’s email updated

Reuse

Citation

BibTeX citation:
@online{gherardi2021,
  author = {Gherardi, Valerio},
  title = {Kgrams V0.1.2 on {CRAN}},
  date = {2021-11-13},
  url = {https://vgherard.github.io/posts/2021-11-13-kgrams-v012-released/kgrams-v012-released.html},
  langid = {en}
}
For attribution, please cite this work as:
Gherardi, Valerio. 2021. “Kgrams V0.1.2 on CRAN.” November 13, 2021. https://vgherard.github.io/posts/2021-11-13-kgrams-v012-released/kgrams-v012-released.html.