R/sbo_predictions.R
, R/sbo_predictor.R
, R/sbo_predtable.R
sbo_predictions.Rd
Train a text predictor via Stupid Back-off
sbo_predictor(object, ...) predictor(object, ...) # S3 method for character sbo_predictor( object, N, dict, .preprocess = identity, EOS = "", lambda = 0.4, L = 3L, filtered = "<UNK>", ... ) # S3 method for sbo_kgram_freqs sbo_predictor(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...) # S3 method for sbo_predtable sbo_predictor(object, ...) sbo_predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...) predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...) # S3 method for character sbo_predtable( object, lambda = 0.4, L = 3L, filtered = "<UNK>", N, dict, .preprocess = identity, EOS = "", ... ) # S3 method for sbo_kgram_freqs sbo_predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...)
object | either a character vector or an object inheriting from classes
|
---|---|
... | further arguments passed to or from other methods. |
N | a length one integer. Order 'N' of the N-gram model. |
dict | a |
.preprocess | a function for corpus preprocessing. For
more details see |
EOS | a length one character vector. String listing End-Of-Sentence
characters. For more details see |
lambda | a length one numeric. Penalization in the Stupid Back-off algorithm. |
L | a length one integer. Maximum number of next-word predictions for a given input (top scoring predictions are retained). |
filtered | a character vector. Words to exclude from next-word predictions. The strings '<UNK>' and '<EOS>' are reserved keywords referring to the Unknown-Word and End-Of-Sentence tokens, respectively. |
A sbo_predictor
object for sbo_predictor()
, a
sbo_predtable
object for sbo_predtable()
.
These functions are generics used to train a text predictor
with Stupid Back-Off. The functions predictor()
and
predtable()
are aliases for sbo_predictor()
and
sbo_predtable()
, respectively.
The sbo_predictor
data structure carries
all information
required for prediction in a compact and efficient (upon retrieval) way,
by directly storing the top L
next-word predictions for each
k-gram prefix observed in the training corpus.
The sbo_predictor
objects are for interactive use. If the training
process is computationally heavy, one can store a "raw" version of the
text predictor in a sbo_predtable
class object, which can be safely
saved out of memory (with e.g. save()
).
The resulting object can be restored
in another R session, and the corresponding sbo_predictor
object
can be loaded rapidly using again the generic constructor
sbo_predictor()
(see example below).
The returned objects are a sbo_predictor
and a sbo_predtable
objects.
The latter contains Stupid Back-Off prediction tables, storing next-word
prediction for each k-gram prefix observed in the text, whereas the former
is an external pointer to an equivalent (but processed) C++ structure.
Both objects have the following attributes:
N
: The order of the underlying N-gram model, "N
".
dict
: The model dictionary.
lambda
: The penalization used in the Stupid Back-Off algorithm.
L
: The maximum number of next-word predictions for a given text
input.
.preprocess
: The function used for text preprocessing.
EOS
: A length one character vector listing all (single character)
end-of-sentence tokens.
Valerio Gherardi
# \donttest{ # Train a text predictor directly from corpus p <- sbo_predictor(twitter_train, N = 3, dict = max_size ~ 1000, .preprocess = preprocess, EOS = ".?!:;") # } # \donttest{ # Train a text predictor from previously computed 'kgram_freqs' object p <- sbo_predictor(twitter_freqs) # } # \donttest{ # Load a text predictor from a Stupid Back-Off prediction table p <- sbo_predictor(twitter_predtable) # } # \donttest{ # Predict from Stupid Back-Off text predictor p <- sbo_predictor(twitter_predtable) predict(p, "i love")#> [1] "you" "it" "my"# } # \donttest{ # Build Stupid Back-Off prediction tables directly from corpus t <- sbo_predtable(twitter_train, N = 3, dict = max_size ~ 1000, .preprocess = preprocess, EOS = ".?!:;") # } # \donttest{ # Build Stupid Back-Off prediction tables from kgram_freqs object t <- sbo_predtable(twitter_freqs) # } if (FALSE) { # Save and reload a 'sbo_predtable' object with base::save() save(t) load("t.rda") }