Build dictionary from training corpus.

sbo_dictionary(
  corpus,
  max_size = Inf,
  target = 1,
  .preprocess = identity,
  EOS = ""
)

dictionary(
  corpus,
  max_size = Inf,
  target = 1,
  .preprocess = identity,
  EOS = ""
)

Arguments

corpus

a character vector. The training corpus from which to extract the dictionary.

max_size

a length one numeric. If less than Inf, only the most frequent max_size words are retained in the dictionary.

target

a length one numeric between 0 and 1. If less than one, retains only as many words as needed to cover a fraction target of the training corpus.

.preprocess

a function for corpus preprocessing. Takes a character vector as input and returns a character vector.

EOS

a length one character vector listing all (single character) end-of-sentence tokens.

Value

A sbo_dictionary object.

Details

The function dictionary() is an alias for sbo_dictionary().

This function builds a dictionary using the most frequent words in a training corpus. Two pruning criterions can be applied:

  1. Dictionary size, as implemented by the max_size argument.

  2. Target coverage fraction, as implemented by the target argument.

If both these criterions imply non-trivial cuts, the most restrictive critierion applies.

The .preprocess argument allows the user to apply a custom transformation to the training corpus, before word tokenization. The EOS argument allows to specify a set of characters to be identified as End-Of-Sentence tokens (and thus not part of words).

The returned object is a sbo_dictionary object, which is a character vector containing words sorted by decreasing corpus frequency. Furthermore, the object stores as attributes the original values of .preprocess and EOS (i.e. the function used in corpus preprocessing and the End-Of-Sentence characters for sentence tokenization).

Author

Valerio Gherardi

Examples

# \donttest{ # Extract dictionary from `twitter_train` corpus (all words) dict <- sbo_dictionary(twitter_train) # Extract dictionary from `twitter_train` corpus (top 1000 words) dict <- sbo_dictionary(twitter_train, max_size = 1000) # Extract dictionary from `twitter_train` corpus (coverage target = 50%) dict <- sbo_dictionary(twitter_train, target = 0.5) # }