Return the frequency count of k-grams in a k-gram frequency table, or whether words are contained in a dictionary.
query(object, x)
# S3 method for class 'kgram_freqs'
query(object, x)
# S3 method for class 'kgrams_dictionary'
query(object, x)an integer vector, containing k-gram counts of x, if
object is a kgram_freqs class object, a logical vector if
object is a dictionary. Vectorized over x.
This generic has slightly different behaviors when querying
for the presence of words in a dictionary and for k-gram counts
in a frequency table respectively.
For words, query() looks for exact matches between the input and the
dictionary entries. Queries of Begin-Of-Sentence (BOS()) and
End-Of-Sentence (EOS()) tokens always return TRUE, and queries
of the Unknown-Word token return FALSE
(see special_tokens).
On the other hand, queries of k-gram counts first perform a word level
tokenization, so that anything separated by one or more space characters
in the input is considered as a single word (thus, for instance queries of
strings such as "i love you", " i love you"), or
"i love you ") all produce the same outcome). Moreover,
querying for any word outside the underlying dictionary returns the counts
corresponding to the Unknown-Word token (UNK()) (e.g., if
the word "prcsrn" is outside the dictionary, querying
"i love prcsrn" is the same as querying
paste("i love", UNK())). Queries from k-grams of order k > N
will return NA.
A subsetting equivalent of query, with synthax object[x] is available
(see the examples).
query(object, x). The query of the empty string "" returns the
total count of words, including the EOS and UNK tokens, but not
the BOS token.
See also the examples below.
# Querying a k-gram frequency table
f <- kgram_freqs("a a b a b b a b", N = 2)
query(f, c("a", "b")) # query single words
#> [1] 4 4
query(f, c("a b")) # query a 2-gram
#> [1] 3
identical(query(f, "c"), query(f, "d")) # TRUE, both "c" and "d" are <UNK>
#> [1] TRUE
identical(query(f, UNK()), query(f, "c")) # TRUE
#> [1] TRUE
query(f, EOS()) # 1, since text is a single sentence
#> [1] 1
f[c("b b", "b")] # query with subsetting synthax
#> [1] 1 4
f[""] # 9 (includes the EOS token)
#> [1] 9
# Querying a dictionary
d <- as_dictionary(c("a", "b"))
query(d, c("a", "b", "c")) # query some words
#> [1] TRUE TRUE FALSE
query(d, c(BOS(), EOS(), UNK())) # c(TRUE, TRUE, FALSE)
#> [1] TRUE TRUE FALSE
d["a"] # query with subsetting synthax
#> [1] TRUE