Compute total and cumulative corpus coverage fraction of a dictionary.

word_coverage(object, corpus, ...)

# S3 method for sbo_dictionary
word_coverage(object, corpus, ...)

# S3 method for character
word_coverage(object, corpus, .preprocess = identity, EOS = "", ...)

# S3 method for sbo_kgram_freqs
word_coverage(object, corpus, ...)

# S3 method for sbo_predictions
word_coverage(object, corpus, ...)

Arguments

object

either a character vector, or an object inheriting from one of the classes sbo_dictionary, sbo_kgram_freqs, sbo_predtable or sbo_predictor. The object storing the dictionary for which corpus coverage is to be computed.

corpus

a character vector.

...

further arguments passed to or from other methods.

.preprocess

preprocessing function for training corpus. See kgram_freqs and sbo_dictionary for further details.

EOS

a length one character vector. String containing End-Of-Sentence characters, see kgram_freqs and sbo_dictionary for further details.

Value

a word_coverage object.

Details

This function computes the corpus coverage fraction of a dictionary, that is the fraction of words appearing in corpus which are contained in the original dictionary.

This function is a generic, accepting as object argument any object storing a dictionary, along with a preprocessing function and a list of End-Of-Sentence characters. This includes all sbo main classes: sbo_dictionary, sbo_kgram_freqs, sbo_predtable and sbo_predictor. When object is a character vector, the preprocessing function and the End-Of-Sentence characters must be specified explicitly.

The coverage fraction is computed cumulatively, and the dependence of coverage with respect to maximal rank can be explored through plot() (see examples below)

See also

Author

Valerio Gherardi

Examples

# \donttest{ c <- word_coverage(twitter_dict, twitter_train) print(c)
#> A 'word_coverage' object. #> #> See summary() for more details. #>
#> Word coverage fraction #> #> Dictionary length: 1000 #> Coverage fraction (w/ EOS): 78.1 % #> Coverage fraction (w/o EOS): 74.9 %
# Plot coverage fraction, including the End-Of-Sentence in word counts. plot(c, include_EOS = TRUE)
# }