Build dictionary from training corpus.
sbo_dictionary( corpus, max_size = Inf, target = 1, .preprocess = identity, EOS = "" ) dictionary( corpus, max_size = Inf, target = 1, .preprocess = identity, EOS = "" )
corpus | a character vector. The training corpus from which to extract the dictionary. |
---|---|
max_size | a length one numeric. If less than |
target | a length one numeric between |
.preprocess | a function for corpus preprocessing. Takes a character vector as input and returns a character vector. |
EOS | a length one character vector listing all (single character) end-of-sentence tokens. |
A sbo_dictionary
object.
The function dictionary()
is an alias for
sbo_dictionary()
.
This function builds a dictionary using the most frequent words in a training corpus. Two pruning criterions can be applied:
Dictionary size, as implemented by the max_size
argument.
Target coverage fraction, as implemented by the target
argument.
If both these criterions imply non-trivial cuts, the most restrictive critierion applies.
The .preprocess
argument allows the user to apply a custom
transformation to the training corpus, before word tokenization. The
EOS
argument allows to specify a set of characters to be identified
as End-Of-Sentence tokens (and thus not part of words).
The returned object is a sbo_dictionary
object, which is a
character vector containing words sorted by decreasing corpus frequency.
Furthermore, the object stores as attributes the original values of
.preprocess
and EOS
(i.e. the function used in corpus
preprocessing and the End-Of-Sentence characters for sentence tokenization).
Valerio Gherardi
# \donttest{ # Extract dictionary from `twitter_train` corpus (all words) dict <- sbo_dictionary(twitter_train) # Extract dictionary from `twitter_train` corpus (top 1000 words) dict <- sbo_dictionary(twitter_train, max_size = 1000) # Extract dictionary from `twitter_train` corpus (coverage target = 50%) dict <- sbo_dictionary(twitter_train, target = 0.5) # }