Construct or coerce to and from a dictionary.

dictionary(object, ...)

# S3 method for kgram_freqs
dictionary(object, size = NULL, cov = NULL, thresh = NULL, ...)

# S3 method for character
dictionary(
  object,
  .preprocess = identity,
  size = NULL,
  cov = NULL,
  thresh = NULL,
  ...
)

# S3 method for connection
dictionary(
  object,
  .preprocess = identity,
  size = NULL,
  cov = NULL,
  thresh = NULL,
  max_lines = Inf,
  batch_size = max_lines,
  ...
)

as_dictionary(object)

# S3 method for kgrams_dictionary
as_dictionary(object)

# S3 method for character
as_dictionary(object)

# S3 method for kgrams_dictionary
as.character(x, ...)

Arguments

object

object from which to extract a dictionary, or to be coerced to dictionary.

...

further arguments passed to or from other methods.

size

either NULL or a length one positive integer. Predefined size of the required dictionary (the top size most frequent words are retained).

cov

either NULL or a length one numeric between 0 and 1. Predefined text coverage fraction of the dictionary (the most frequent words providing the required coverage are retained).

thresh

either NULL or length one a positive integer. Minimum word count threshold to include a word in the dictionary.

.preprocess

a function taking a character vector as input and returning a character vector as output. Optional preprocessing transformation applied to text before creating the dictionary.

max_lines

a length one positive integer or Inf. Maximum number of lines to be read from the connection. If Inf, keeps reading until the End-Of-File.

batch_size

a length one positive integer less than or equal to max_lines.Size of text batches when reading text from connection.

x

a dictionary.

Value

A dictionary for dictionary() and as_dictionary(), a character vector for the as.character()

method.

Details

These generic functions are used to build dictionary objects, or to coerce from other formats to dictionary, and from a dictionary to a character vector. By now, the only non-trivial type coercible to dictionary is character, in which case each entry of the input vector is considered as a single word. Coercion from dictionary to character returns the list of words included in the dictionary as a regular character vector.

Dictionaries can be extracted from kgram_freqs objects, or built from text coming either directly from a character vector or a connection.

A single preprocessing transformation can be applied before processing the text for unique words. After preprocessing, anything delimited by one or more white space characters in the transformed text input is counted as a word and may be added to the dictionary modulo additional constraints.

The possible constraints for including a word in the dictionary can be of three types: (i) fixed size of dictionary, implemented by the size argument; (ii) fixed text covering fraction, as specified by the cov argument; or (iii) minimum word count threshold, thresh argument. Only one of these constraints can be applied at a time, so that specifying more than one of size, cov or thresh results in an error.

Author

Valerio Gherardi

Examples

# Building a dictionary from Shakespeare's "Much Ado About Nothing"

dict <- dictionary(much_ado)
length(dict)
#> [1] 3046
query(dict, "leonato") # TRUE
#> [1] TRUE
query(dict, c("thy", "thou")) # c(TRUE, TRUE)
#> [1] TRUE TRUE
query(dict, "smartphones") # FALSE
#> [1] FALSE

# Getting list of words as regular character vector
words <- as.character(dict)
head(words)
#> [1] "much"    "ado"     "about"   "nothing" ":"       "entire" 

# Building a dictionary from a list of words
dict <- as_dictionary(c("i", "the", "a"))