Sentence tokenizer — tknz

Extract sentences from a batch of text lines.

tknz_sent(input, EOS = "[.?!:;]+", keep_first = FALSE)

Arguments

input: a character vector.
EOS: a regular expression matching an End-Of-Sentence delimiter.
keep_first: TRUE or FALSE? Should the first character of the matches be appended to the returned sentences (with a space)?

Value

a character vector, each entry of which corresponds to a single sentence.

Details

tknz_sent() splits text into sentences, where sentence delimiters are specified by a regular expression through the EOS argument. Specifically, when an EOS token is found, the next sentence begins at the first position in the input string not containing any of the EOS tokens or white space (so that entries like "Hi there!!!" or "Hello . . ." are both recognized as a single sentence).

If keep_first is FALSE, the delimiters are stripped off from the returned sequences. Otherwise, the first character of the substrings matching the EOS regular expressions are appended to the corresponding sentences, preceded by a white space.

In the absence of any EOS delimiter, tknz_sent() returns the input as is, since parts of text corresponding to different entries of the input vector x are understood as parts of separate sentences.

Note. This function, as well as preprocess, are included in the library for illustrative purposes only, and are not optimized for performance. Furthermore (for performance reasons) the function has a separate implementation for Windows and UNIX OS types, respectively, so that results obtained in the two cases may differ slightly. In contexts that require full reproducibility, users are encouraged to define their own preprocessing and tokenization custom functions - or to work with externally processed data.

Author

Valerio Gherardi

Examples

tknz_sent("Hi there! I'm using kgrams.")
#> [1] "Hi there"         "I'm using kgrams"