Get k-gram frequency tables from a training corpus.

kgram_freqs(corpus, N, dict, .preprocess = identity, EOS = "")

sbo_kgram_freqs(corpus, N, dict, .preprocess = identity, EOS = "")

kgram_freqs_fast(corpus, N, dict, erase = "", lower_case = FALSE, EOS = "")

sbo_kgram_freqs_fast(corpus, N, dict, erase = "", lower_case = FALSE, EOS = "")

Arguments

corpus

a character vector. The training corpus from which to extract k-gram frequencies.

N

a length one integer. The maximum order of k-grams for which frequencies are to be extracted.

dict

either a sbo_dictionary object, a character vector, or a formula (see details). The language model dictionary.

.preprocess

a function to apply before k-gram tokenization.

EOS

a length one character vector listing all (single character) end-of-sentence tokens.

erase

a length one character vector. Regular expression matching parts of text to be erased from input. The default removes anything not alphanumeric, white space, apostrophes or punctuation characters (i.e. ".?!:;").

lower_case

a length one logical vector. If TRUE, puts everything to lower case.

Value

A sbo_kgram_freqs object, containing the k-gram frequency tables for k = 1, 2, ..., N.

Details

These functions extract all k-gram frequency tables from a text corpus up to a specified k-gram order N. These are the building blocks to train any N-gram model. The functions sbo_kgram_freqs() and sbo_kgram_freqs_fast() are aliases for kgram_freqs() and kgram_freqs_fast(), respectively.

The optimized version kgram_freqs_fast(erase = x, lower_case = y) is equivalent to kgram_freqs(.preprocess = preprocess(erase = x, lower_case = y)), but more efficient (both from the speed and memory point of view).

Both kgram_freqs() and kgram_freqs_fast() employ a fixed (user specified) dictionary: any out-of-vocabulary word gets effectively replaced by an "unknown word" token. This is specified through the argument dict, which accepts three types of arguments: a sbo_dictionary object, a character vector (containing the words of the dictionary) or a formula. In this last case, valid formulas can be either max_size ~ V or target ~ f, where V and f represent a dictionary size and a corpus word coverage fraction (of corpus), respectively. This usage of the dict argument allows to build the model dictionary 'on the fly'.

The return value is a "sbo_kgram_freqs" object, i.e. a list of N tibbles, storing frequency counts for each k-gram observed in the training corpus, for k = 1, 2, ..., N. In these tables, words are represented by integer numbers corresponding to their position in the reference dictionary. The special codes 0, length(dictionary)+1 and length(dictionary)+2 correspond to the "Begin-Of-Sentence", "End-Of-Sentence" and "Unknown word" tokens, respectively.

Furthermore, the returned objected has the following attributes:

  • N: The highest order of N-grams.

  • dict: The reference dictionary, sorted by word frequency.

  • .preprocess: The function used for text preprocessing.

  • EOS: A length one character vector listing all (single character) end-of-sentence tokens employed in k-gram tokenization.

The .preprocess argument of kgram_freqs allows the user to apply a custom transformation to the training corpus, before kgram tokenization takes place.

The algorithm for k-gram tokenization considers anything separated by (any number of) white spaces (i.e. " ") as a single word. Sentences are split according to end-of-sentence (single character) tokens, as specified by the EOS argument. Additionally text belonging to different entries of the preprocessed input vector which are understood to belong to different sentences.

Nota Bene: It is useful to keep in mind that the function passed through the .preprocess argument also captures its enclosing environment, which is by default the environment in which the former was defined. If, for instance, .preprocess was defined in the global environment, and the latter binds heavy objects, the resulting sbo_kgram_freqs will contain bindings to the same objects. If sbo_kgram_freqs is stored out of memory and recalled in another R session, these objects will also be reloaded in memory. For this reason, for non interactive use, it is advisable to avoid using preprocessing functions defined in the global environment (for instance, base::identity is preferred to function(x) x).

Author

Valerio Gherardi

Examples

# \donttest{ # Obtain k-gram frequency table from corpus ## Get k-gram frequencies, for k <= N = 3. ## The dictionary is built on the fly, using the most frequent 1000 words. freqs <- kgram_freqs(corpus = twitter_train, N = 3, dict = max_size ~ 1000, .preprocess = preprocess, EOS = ".?!:;") freqs
#> A k-gram frequency table. #> #> See summary() for more details; ?predict.sbo_kgram_freqs for usage help. #>
## Using a predefined dictionary freqs <- kgram_freqs_fast(twitter_train, N = 3, dict = twitter_dict, erase = "[^.?!:;'\\w\\s]", lower_case = TRUE, EOS = ".?!:;") freqs
#> A k-gram frequency table. #> #> See summary() for more details; ?predict.sbo_kgram_freqs for usage help. #>
## 2-grams, no preprocessing, use a dictionary covering 50% of corpus freqs <- kgram_freqs(corpus = twitter_train, N = 2, dict = target ~ 0.5, EOS = ".?!:;") freqs
#> A k-gram frequency table. #> #> See summary() for more details; ?predict.sbo_kgram_freqs for usage help. #>
# } # \donttest{ # Obtain k-gram frequency table from corpus freqs <- kgram_freqs_fast(twitter_train, N = 3, dict = twitter_dict) ## Print result freqs
#> A k-gram frequency table. #> #> See summary() for more details; ?predict.sbo_kgram_freqs for usage help. #>
# }