Word coverage fraction

Compute total and cumulative corpus coverage fraction of a dictionary.

word_coverage(object, corpus, ...)

# S3 method for sbo_dictionary
word_coverage(object, corpus, ...)

# S3 method for character
word_coverage(object, corpus, .preprocess = identity, EOS = "", ...)

# S3 method for sbo_kgram_freqs
word_coverage(object, corpus, ...)

# S3 method for sbo_predictions
word_coverage(object, corpus, ...)

Arguments

object	either a character vector, or an object inheriting from one of the classes `sbo_dictionary`, `sbo_kgram_freqs`, `sbo_predtable` or `sbo_predictor`. The object storing the dictionary for which corpus coverage is to be computed.
corpus	a character vector.
...	further arguments passed to or from other methods.
.preprocess	preprocessing function for training corpus. See `kgram_freqs` and `sbo_dictionary` for further details.
EOS	a length one character vector. String containing End-Of-Sentence characters, see `kgram_freqs` and `sbo_dictionary` for further details.

Value

a word_coverage object.

Details

This function computes the corpus coverage fraction of a dictionary, that is the fraction of words appearing in corpus which are contained in the original dictionary.

This function is a generic, accepting as object argument any object storing a dictionary, along with a preprocessing function and a list of End-Of-Sentence characters. This includes all sbo main classes: sbo_dictionary, sbo_kgram_freqs, sbo_predtable and sbo_predictor. When object is a character vector, the preprocessing function and the End-Of-Sentence characters must be specified explicitly.

The coverage fraction is computed cumulatively, and the dependence of coverage with respect to maximal rank can be explored through plot() (see examples below)

Author

Valerio Gherardi

Examples

# \donttest{
c <- word_coverage(twitter_dict, twitter_train)
print(c)
#> A 'word_coverage' object.
#> 
#> See summary() for more details.
#> 
summary(c)
#> Word coverage fraction
#> 
#> Dictionary length: 1000 
#> Coverage fraction (w/ EOS): 78.1 %
#> Coverage fraction (w/o EOS): 74.9 %
# Plot coverage fraction, including the End-Of-Sentence in word counts.
plot(c, include_EOS = TRUE)
# }

Arguments

Value

Details

See also

Author

Examples