R/kgram_freqs.R
, R/kgram_freqs_doc.R
, R/kgram_freqs_fast.R
kgram_freqs.Rd
Get k-gram frequency tables from a training corpus.
kgram_freqs(corpus, N, dict, .preprocess = identity, EOS = "") sbo_kgram_freqs(corpus, N, dict, .preprocess = identity, EOS = "") kgram_freqs_fast(corpus, N, dict, erase = "", lower_case = FALSE, EOS = "") sbo_kgram_freqs_fast(corpus, N, dict, erase = "", lower_case = FALSE, EOS = "")
corpus | a character vector. The training corpus from which to extract k-gram frequencies. |
---|---|
N | a length one integer. The maximum order of k-grams for which frequencies are to be extracted. |
dict | either a |
.preprocess | a function to apply before k-gram tokenization. |
EOS | a length one character vector listing all (single character) end-of-sentence tokens. |
erase | a length one character vector. Regular expression matching parts of text to be erased from input. The default removes anything not alphanumeric, white space, apostrophes or punctuation characters (i.e. ".?!:;"). |
lower_case | a length one logical vector. If TRUE, puts everything to lower case. |
A sbo_kgram_freqs
object, containing the k-gram
frequency tables for k = 1, 2, ..., N.
These functions extract all k-gram frequency tables from a text
corpus up to a specified k-gram order N. These are
the building blocks to train any N-gram model. The functions
sbo_kgram_freqs()
and sbo_kgram_freqs_fast()
are aliases for
kgram_freqs()
and kgram_freqs_fast()
, respectively.
The optimized version kgram_freqs_fast(erase = x, lower_case = y)
is equivalent to
kgram_freqs(.preprocess = preprocess(erase = x, lower_case = y))
,
but more efficient (both from the speed and memory point of view).
Both kgram_freqs()
and kgram_freqs_fast()
employ a fixed
(user specified) dictionary: any out-of-vocabulary word gets effectively
replaced by an "unknown word" token. This is specified through the argument
dict
, which accepts three types of arguments: a sbo_dictionary
object, a character vector (containing the words of the dictionary) or a
formula. In this last case, valid formulas can be either max_size ~ V
or target ~ f
, where V
and f
represent a dictionary size
and a corpus word coverage fraction (of corpus
), respectively. This
usage of the dict
argument allows to build the model dictionary
'on the fly'.
The return value is a "sbo_kgram_freqs
" object, i.e. a list of N tibbles,
storing frequency counts for each k-gram observed in the training corpus, for
k = 1, 2, ..., N. In these tables, words are represented by
integer numbers corresponding to their position in the
reference dictionary. The special codes 0
,
length(dictionary)+1
and length(dictionary)+2
correspond to the "Begin-Of-Sentence", "End-Of-Sentence"
and "Unknown word" tokens, respectively.
Furthermore, the returned objected has the following attributes:
N
: The highest order of N-grams.
dict
: The reference dictionary, sorted by word frequency.
.preprocess
: The function used for text preprocessing.
EOS
: A length one character vector listing all (single character)
end-of-sentence tokens employed in k-gram tokenization.
The .preprocess
argument of kgram_freqs
allows the user to
apply a custom transformation to the training corpus, before kgram
tokenization takes place.
The algorithm for k-gram tokenization considers anything separated by
(any number of) white spaces (i.e. " ") as a single word. Sentences are split
according to end-of-sentence (single character) tokens, as specified
by the EOS
argument. Additionally text belonging to different entries of
the preprocessed input vector which are understood to belong to different
sentences.
Nota Bene: It is useful to keep in mind that the function
passed through the .preprocess
argument also captures its enclosing
environment, which is by default the environment in which the former
was defined.
If, for instance, .preprocess
was defined in the global environment,
and the latter binds heavy objects, the resulting sbo_kgram_freqs
will
contain bindings to the same objects. If sbo_kgram_freqs
is stored out of
memory and recalled in another R session, these objects will also be reloaded
in memory.
For this reason, for non interactive use, it is advisable to avoid using
preprocessing functions defined in the global environment
(for instance, base::identity
is preferred to function(x) x
).
Valerio Gherardi
# \donttest{ # Obtain k-gram frequency table from corpus ## Get k-gram frequencies, for k <= N = 3. ## The dictionary is built on the fly, using the most frequent 1000 words. freqs <- kgram_freqs(corpus = twitter_train, N = 3, dict = max_size ~ 1000, .preprocess = preprocess, EOS = ".?!:;") freqs#> A k-gram frequency table. #> #> See summary() for more details; ?predict.sbo_kgram_freqs for usage help. #>## Using a predefined dictionary freqs <- kgram_freqs_fast(twitter_train, N = 3, dict = twitter_dict, erase = "[^.?!:;'\\w\\s]", lower_case = TRUE, EOS = ".?!:;") freqs#> A k-gram frequency table. #> #> See summary() for more details; ?predict.sbo_kgram_freqs for usage help. #>## 2-grams, no preprocessing, use a dictionary covering 50% of corpus freqs <- kgram_freqs(corpus = twitter_train, N = 2, dict = target ~ 0.5, EOS = ".?!:;") freqs#> A k-gram frequency table. #> #> See summary() for more details; ?predict.sbo_kgram_freqs for usage help. #># } # \donttest{ # Obtain k-gram frequency table from corpus freqs <- kgram_freqs_fast(twitter_train, N = 3, dict = twitter_dict) ## Print result freqs#> A k-gram frequency table. #> #> See summary() for more details; ?predict.sbo_kgram_freqs for usage help. #># }