Construct or coerce to and from a dictionary.
dictionary(object, ...)
# S3 method for class 'kgram_freqs'
dictionary(object, size = NULL, cov = NULL, thresh = NULL, ...)
# S3 method for class 'character'
dictionary(
object,
.preprocess = identity,
size = NULL,
cov = NULL,
thresh = NULL,
...
)
# S3 method for class 'connection'
dictionary(
object,
.preprocess = identity,
size = NULL,
cov = NULL,
thresh = NULL,
max_lines = Inf,
batch_size = max_lines,
...
)
as_dictionary(object)
# S3 method for class 'kgrams_dictionary'
as_dictionary(object)
# S3 method for class 'character'
as_dictionary(object)
# S3 method for class 'kgrams_dictionary'
as.character(x, ...)
object from which to extract a dictionary, or to be coerced to dictionary.
further arguments passed to or from other methods.
either NULL
or a length one positive integer. Predefined size of the
required dictionary (the top size
most frequent words are retained).
either NULL
or a length one numeric between 0
and 1
.
Predefined text coverage fraction of the dictionary
(the most frequent words providing the required coverage are retained).
either NULL
or length one a positive integer.
Minimum word count threshold to include a word in the dictionary.
a function taking a character vector as input and returning a character vector as output. Optional preprocessing transformation applied to text before creating the dictionary.
a length one positive integer or Inf
.
Maximum number of lines to be read from the connection
.
If Inf
, keeps reading until the End-Of-File.
a length one positive integer less than or equal to
max_lines
.Size of text batches when reading text from
connection
.
a dictionary
.
A dictionary
for dictionary()
and
as_dictionary()
, a character vector for the as.character()
method.
These generic functions are used to build dictionary
objects,
or to coerce from other formats to dictionary
, and from a
dictionary
to a character vector. By now, the only
non-trivial type coercible to dictionary
is character
,
in which case each entry of the input vector is considered as a single word.
Coercion from dictionary
to character
returns the list of
words included in the dictionary as a regular character vector.
Dictionaries can be extracted from kgram_freqs
objects, or built
from text coming either directly from a character vector or a connection.
A single preprocessing transformation can be applied before processing the text for unique words. After preprocessing, anything delimited by one or more white space characters in the transformed text input is counted as a word and may be added to the dictionary modulo additional constraints.
The possible constraints for including a word in the dictionary can be of
three types: (i) fixed size of dictionary, implemented by the size
argument; (ii) fixed text covering fraction, as specified by the cov
argument; or (iii) minimum word count threshold, thresh
argument.
Only one of these constraints can be applied at a time,
so that specifying more than one of size
, cov
or thresh
results in an error.
# Building a dictionary from Shakespeare's "Much Ado About Nothing"
dict <- dictionary(much_ado)
length(dict)
#> [1] 3046
query(dict, "leonato") # TRUE
#> [1] TRUE
query(dict, c("thy", "thou")) # c(TRUE, TRUE)
#> [1] TRUE TRUE
query(dict, "smartphones") # FALSE
#> [1] FALSE
# Getting list of words as regular character vector
words <- as.character(dict)
head(words)
#> [1] "much" "ado" "about" "nothing" ":" "entire"
# Building a dictionary from a list of words
dict <- as_dictionary(c("i", "the", "a"))