A minimal text preprocessing utility.

preprocess(input, erase = "[^.?!:;'[:alnum:][:space:]]", lower_case = TRUE)

Arguments

input

a character vector.

erase

a length one character vector. Regular expression matching parts of text to be erased from input. The default removes anything not alphanumeric ([A-z0-9]), space (white space, tab, vertical tab, newline, form feed, carriage return), apostrophes or punctuation characters ("[.?!:;]").

lower_case

a length one logical vector. If TRUE, puts everything to lower case.

Value

a character vector containing the processed output.

Details

The expressions preprocess(x, erase = pattern, lower_case = TRUE) and preprocess(x, erase = pattern, lower_case = FALSE) are roughly equivalent to tolower(gsub(pattern, "", x)) and gsub(pattern, "", x), respectively, provided that the regular expression 'pattern' is correctly recognized by R.

Note. This function, as well as tknz_sent, are included in the library for illustrative purposes only, and are not optimized for performance. Furthermore (for performance reasons) the function has a separate implementation for Windows and UNIX OS types, respectively, so that results obtained in the two cases may differ slightly. In contexts that require full reproducibility, users are encouraged to define their own preprocessing and tokenization custom functions - or to work with externally processed data.

Author

Valerio Gherardi

Examples

preprocess("#This Is An Example@-@!#")
#> [1] "this is an example!"