Extract k-gram frequency counts from a text or a connection.
kgram_freqs
probability()
: compute word continuation and sentence probabilities
using Maximum Likelihood estimates. See probability.
language_model()
: build a k-gram language model using various
probability smoothing techniques. See language_model.
kgram_freqs(object, ...)
# S3 method for class 'numeric'
kgram_freqs(
object,
.preprocess = identity,
.tknz_sent = identity,
dict = NULL,
...
)
# S3 method for class 'kgram_freqs'
kgram_freqs(object, ...)
# S3 method for class 'character'
kgram_freqs(
object,
N,
.preprocess = identity,
.tknz_sent = identity,
dict = NULL,
open_dict = is.null(dict),
verbose = FALSE,
...
)
# S3 method for class 'connection'
kgram_freqs(
object,
N,
.preprocess = identity,
.tknz_sent = identity,
dict = NULL,
open_dict = is.null(dict),
verbose = FALSE,
max_lines = Inf,
batch_size = max_lines,
...
)
process_sentences(
text,
freqs,
.preprocess = attr(freqs, ".preprocess"),
.tknz_sent = attr(freqs, ".tknz_sent"),
open_dict = TRUE,
in_place = TRUE,
verbose = FALSE,
...
)
# S3 method for class 'character'
process_sentences(
text,
freqs,
.preprocess = attr(freqs, ".preprocess"),
.tknz_sent = attr(freqs, ".tknz_sent"),
open_dict = TRUE,
in_place = TRUE,
verbose = FALSE,
...
)
# S3 method for class 'connection'
process_sentences(
text,
freqs,
.preprocess = attr(freqs, ".preprocess"),
.tknz_sent = attr(freqs, ".tknz_sent"),
open_dict = TRUE,
in_place = TRUE,
verbose = FALSE,
max_lines = Inf,
batch_size = max_lines,
...
)
any type allowed by the available methods. The type defines the
behaviour of kgram_freqs()
as a default constructor, a copy
constructor or a constructor of a non-trivial object. See ‘Details’.
further arguments passed to or from other methods.
a function taking a character vector as input and returning a character vector as output. Optional preprocessing transformation applied to text before k-gram tokenization. See ‘Details’.
a function taking a character vector as input and returning a character vector as output. Optional sentence tokenization step applied to text after preprocessing and before k-gram tokenization. See ‘Details’.
anything coercible to class dictionary. Optional pre-specified word dictionary.
a length one integer. Maximum order of k-grams to be considered.
TRUE
or FALSE
. If TRUE
, any new
word encountered during processing not appearing in the original dictionary
is included into the dictionary. Otherwise, new words are replaced by an
unknown word token. It is by default TRUE
if dict
is
specified, FALSE
otherwise.
Print current progress to the console.
a length one positive integer or Inf
.
Maximum number of lines to be read from the connection
.
If Inf
, keeps reading until the End-Of-File.
a length one positive integer less than or equal to
max_lines
.Size of text batches when reading text from
connection
.
a character vector or a connection. Source of text from which k-gram frequencies are to be extracted.
a kgram_freqs
object, to which new k-gram counts from
text
are to be added.
TRUE
or FALSE
. Should the initial
kgram_freqs
object be modified in place?
A kgram_freqs
class object: k-gram frequency table storing
k-gram counts from text. For process_sentences()
, the updated
kgram_freqs
object is returned invisibly if in_place
is
TRUE
, visibly otherwise.
The function kgram_freqs()
is a generic constructor for
objects of class kgram_freqs
, i.e. k-gram frequency tables. The
constructor from integer
returns an empty 'kgram_freqs' of fixed
order, with an optional
predefined dictionary (which can be empty) and .preprocess
and
.tknz_sent
functions to be used as defaults in other kgram_freqs
methods. The constructor from kgram_freqs
returns a copy of an
existing object, and it is provided because, in general, kgram_freqs
objects have reference semantics, as discussed below.
The following discussion focuses on process_sentences()
generic, as
well as on the character
and connection
methods of the
constructor kgram_freqs()
. These functions extract k-gram
frequency counts from a text source, which may be either a character vector
or a connection. The second option is useful if one wants to avoid loading
the full text corpus in physical memory, allowing to process text from
different sources such as files, compressed files or URLs.
The returned object is of class kgram_freqs
(a thin wrapper
around the internal C++ class where all k-gram computations take place).
kgram_freqs
objects have methods for querying bare k-gram frequencies
(query) and maximum likelihood estimates of sentence
probabilities or word continuation probabilities
(see probability)) . More importantly
kgram_freqs
objects are used to create language_model
objects, which support various probability smoothing techniques.
The function kgram_freqs()
is used to construct a new
kgram_freqs
object, initializing it with the k-gram counts from
the text
input, whereas process_sentences()
is used to
add k-gram counts from a new text
to an existing
kgram_freqs
object, freqs
. In this second case, the initial
object freqs
can either be modified in place
(for in_place == TRUE
, the default) or by making a copy
(in_place == FALSE
), see the examples below.
The final object is returned invisibly when modifying in place,
visibly in the second case. It is worth to mention that modifying in place
a kgram_freqs
object freqs
will also affect
language_model
objects created from freqs
with
language_model()
, which will also be updated with the new information.
If one wants to avoid this behaviour, one can make copies using either the
kgram_freqs()
copy constructor, or the in_place = FALSE
argument.
The dict
argument allows to provide an initial set of known
words. Subsequently, one can either work with such a closed dictionary
(open_dict == FALSE
), or extended the dictionary with all
new words encountered during k-gram processing
(open_dict == TRUE
) .
The .preprocess
and .tknz_sent
functions are applied
before k-gram counting takes place, and are in principle
arbitrary transformations of the original text.
After preprocessing and sentence tokenization, each line of the
transformed input is presented to the k-gram counting algorithm as a separate
sentence (these sentences are implicitly padded
with N - 1
Begin-Of-Sentence (BOS) and one End-Of-Sentence (EOS)
tokens, respectively. This is illustrated in the examples). For basic
usage, this package offers the utilities preprocess and
tknz_sent. Notice that, strictly speaking, there is
some redundancy in these two arguments, as the processed input to the k-gram
counting algorithm is .tknz_sent(.preprocess(text))
.
They appear explicitly as separate arguments for two main reasons:
The presence of .tknz_sent
is a reminder of the
fact that sentences have to be explicitly separeted in different entries
of the processed input, in order for kgram_freqs()
to append the
correct Begin-Of-Sentence and End-Of-Sentence paddings to each sentence.
At prediction time (e.g. with probability), by default only
.preprocess
is applied when computing conditional probabilities,
whereas both .preprocess()
and .tknz_sent()
are
applied when computing sentence absolute probabilities.
# Build a k-gram frequency table from a character vector
f <- kgram_freqs("a b b a a", 3)
f
#> A k-gram frequency table.
summary(f)
#> A k-gram frequency table.
#>
#> Parameters:
#> * N: 3
#> * V: 2
#>
#> Number of words in training corpus:
#> * W: 6
#>
#> Number of distinct k-grams with positive counts:
#> * 1-grams:4
#> * 2-grams:7
#> * 3-grams:6
query(f, c("a", "b")) # c(3, 2)
#> [1] 3 2
query(f, c("a b", "a" %+% EOS(), BOS() %+% "a b")) # c(1, 1, 1)
#> [1] 1 1 1
query(f, "a b b a") # NA (counts for k-grams of order k > 3 are not known)
#> [1] NA
process_sentences("b", f)
query(f, c("a", "b")) # c(3, 3): 'f' is updated in place
#> [1] 3 3
f1 <- process_sentences("b", f, in_place = FALSE)
query(f, c("a", "b")) # c(3, 3): 'f' is copied
#> [1] 3 3
query(f1, c("a", "b")) # c(3, 4): the new 'f1' stores the updated counts
#> [1] 3 4
# Build a k-gram frequency table from a file connection
if (FALSE) { # \dontrun{
f <- kgram_freqs(file("my_text_file.txt"), 3)
} # }
# Build a k-gram frequency table from an URL connection
if (FALSE) { # \dontrun{
f <- kgram_freqs(url("http://my.website/my_text_file.txt"), 3)
} # }