Query k-gram frequency tables or dictionaries

Return the frequency count of k-grams in a k-gram frequency table, or whether words are contained in a dictionary.

query(object, x)

# S3 method for class 'kgram_freqs'
query(object, x)

# S3 method for class 'kgrams_dictionary'
query(object, x)

Arguments

object: a kgram_freqs or dictionary class object.
x: a character vector. A list of k-grams if object is of class kgram_freqs, a list of words if object is a dictionary.

Value

an integer vector, containing k-gram counts of x, if object is a kgram_freqs class object, a logical vector if object is a dictionary. Vectorized over x.

Details

This generic has slightly different behaviors when querying for the presence of words in a dictionary and for k-gram counts in a frequency table respectively. For words, query() looks for exact matches between the input and the dictionary entries. Queries of Begin-Of-Sentence (BOS()) and End-Of-Sentence (EOS()) tokens always return TRUE, and queries of the Unknown-Word token return FALSE (see special_tokens).

On the other hand, queries of k-gram counts first perform a word level tokenization, so that anything separated by one or more space characters in the input is considered as a single word (thus, for instance queries of strings such as "i love you", " i love you"), or "i love you ") all produce the same outcome). Moreover, querying for any word outside the underlying dictionary returns the counts corresponding to the Unknown-Word token (UNK()) (e.g., if the word "prcsrn" is outside the dictionary, querying "i love prcsrn" is the same as querying paste("i love", UNK())). Queries from k-grams of order k > N will return NA.

A subsetting equivalent of query, with synthax object[x] is available (see the examples). query(object, x). The query of the empty string "" returns the total count of words, including the EOS and UNK tokens, but not the BOS token.

Author

Valerio Gherardi

Examples

# Querying a k-gram frequency table
f <- kgram_freqs("a a b a b b a b", N = 2)
query(f, c("a", "b")) # query single words
#> [1] 4 4
query(f, c("a b")) # query a 2-gram
#> [1] 3
identical(query(f, "c"), query(f, "d"))  # TRUE, both "c" and "d" are <UNK>
#> [1] TRUE
identical(query(f, UNK()), query(f, "c")) # TRUE
#> [1] TRUE
query(f, EOS()) # 1, since text is a single sentence
#> [1] 1
f[c("b b", "b")] # query with subsetting synthax 
#> [1] 1 4
f[""] # 9 (includes the EOS token)
#> [1] 9

# Querying a dictionary
d <- as_dictionary(c("a", "b"))
query(d, c("a", "b", "c")) # query some words
#> [1]  TRUE  TRUE FALSE
query(d, c(BOS(), EOS(), UNK())) # c(TRUE, TRUE, FALSE)
#> [1]  TRUE  TRUE FALSE
d["a"] # query with subsetting synthax
#> [1] TRUE