Return the frequency count of k-grams in a k-gram frequency table, or whether words are contained in a dictionary.
query(object, x)
# S3 method for class 'kgram_freqs'
query(object, x)
# S3 method for class 'kgrams_dictionary'
query(object, x)
an integer vector, containing k-gram counts of x
, if
object
is a kgram_freqs
class object, a logical vector if
object
is a dictionary
. Vectorized over x
.
This generic has slightly different behaviors when querying
for the presence of words in a dictionary and for k-gram counts
in a frequency table respectively.
For words, query()
looks for exact matches between the input and the
dictionary entries. Queries of Begin-Of-Sentence (BOS()
) and
End-Of-Sentence (EOS()
) tokens always return TRUE
, and queries
of the Unknown-Word token return FALSE
(see special_tokens).
On the other hand, queries of k-gram counts first perform a word level
tokenization, so that anything separated by one or more space characters
in the input is considered as a single word (thus, for instance queries of
strings such as "i love you"
, " i love you"
), or
"i love you "
) all produce the same outcome). Moreover,
querying for any word outside the underlying dictionary returns the counts
corresponding to the Unknown-Word token (UNK()
) (e.g., if
the word "prcsrn"
is outside the dictionary, querying
"i love prcsrn"
is the same as querying
paste("i love", UNK())
). Queries from k-grams of order k > N
will return NA
.
A subsetting equivalent of query, with synthax object[x]
is available
(see the examples).
query(object, x)
. The query of the empty string ""
returns the
total count of words, including the EOS
and UNK
tokens, but not
the BOS
token.
See also the examples below.
# Querying a k-gram frequency table
f <- kgram_freqs("a a b a b b a b", N = 2)
query(f, c("a", "b")) # query single words
#> [1] 4 4
query(f, c("a b")) # query a 2-gram
#> [1] 3
identical(query(f, "c"), query(f, "d")) # TRUE, both "c" and "d" are <UNK>
#> [1] TRUE
identical(query(f, UNK()), query(f, "c")) # TRUE
#> [1] TRUE
query(f, EOS()) # 1, since text is a single sentence
#> [1] 1
f[c("b b", "b")] # query with subsetting synthax
#> [1] 1 4
f[""] # 9 (includes the EOS token)
#> [1] 9
# Querying a dictionary
d <- as_dictionary(c("a", "b"))
query(d, c("a", "b", "c")) # query some words
#> [1] TRUE TRUE FALSE
query(d, c(BOS(), EOS(), UNK())) # c(TRUE, TRUE, FALSE)
#> [1] TRUE TRUE FALSE
d["a"] # query with subsetting synthax
#> [1] TRUE