Extract sentences from a batch of text lines.
tknz_sent(input, EOS = "[.?!:;]+", keep_first = FALSE)
a character vector, each entry of which corresponds to a single sentence.
tknz_sent()
splits text into sentences, where sentence delimiters are
specified by a regular expression through the EOS
argument.
Specifically, when an EOS token is found, the next sentence begins at the
first position in the input string not containing any of the EOS tokens
or white space (so that entries like "Hi there!!!"
or
"Hello . . ."
are both recognized as a single sentence).
If keep_first
is FALSE
, the delimiters are stripped off from
the returned sequences. Otherwise, the first character of the substrings
matching the EOS
regular expressions are appended to the corresponding
sentences, preceded by a white space.
In the absence of any EOS
delimiter, tknz_sent()
returns the input as is, since parts of text corresponding to different
entries of the input vector x
are understood as parts of separate
sentences.
Note. This function, as well as preprocess, are included in the library for illustrative purposes only, and are not optimized for performance. Furthermore (for performance reasons) the function has a separate implementation for Windows and UNIX OS types, respectively, so that results obtained in the two cases may differ slightly. In contexts that require full reproducibility, users are encouraged to define their own preprocessing and tokenization custom functions - or to work with externally processed data.
tknz_sent("Hi there! I'm using kgrams.")
#> [1] "Hi there" "I'm using kgrams"