Breaking changes

  • tknz_sent() and preprocess() now have a different implementation on Windows and UNIX OSs, respectively (since the previous C++ implementation has impredictable behaviour on Windows, see #30). This fix also included minor changes in the tknz_sent() output, in some corner cases (e.g. tknz_sent("") now returns character(0), wheareas it used to return "").

New features

  • perplexity() gets a new argument exp that allows to return the cross-entropy per word, rather than perplexity (its exponential).
  • perplexity.character() gets a new argument detailed that allows to return, alongside with the total perplexity of the input document, also the cross-entropies and word lengths of individual sentences. Closes #28.

Improvements

  • Minor documentation improvements.
  • Removed “Tools for…” at the beginning of package DESCRIPTION, as per CRAN’s request.
  • Simplified examples in ?kgram_freqs.
  • Remove dependency from external online sources in vignette.

Overall Software Improvements

  • The package’s test suite has been greatly extended.
  • Improved error/warning conditions for wrong arguments.
  • Re-enabled compiler diagnostics as per CRAN policy (#19)

API Changes

New features

  • as_dictionary(NULL) now returns an empty dictionary.

Bug Fixes

  • Fixed bug causing .preprocess and .tknz_sent arguments to be ignored in process_sentences().
  • Fixed previously wrong defaults for max_lines and batch_size arguments in kgram_freqs.connection().
  • Added print method for class dictionary.
  • Fixed bug causing invalid results in dictionary() with batch processing and non-trivial size constraints on vocabulary size.

Other

  • Maintainer’s email updated