这是我的示例文字:
text = "First sentence. This is a second sentence. I like pets e.g. cats or birds."
我有一个按句子分割文本的功能
library(stringi)
split_by_sentence <- function (text) {
# split based on periods, exclams or question marks
result <- unlist(strsplit(text, "\\.\\s|\\?|!") )
result <- stri_trim_both(result)
result <- result [nchar (result) > 0]
if (length (result) == 0)
result <- ""
return (result)
}
实际上是用标点字符分割的。这是输出:
> split_by_sentence(text)
[1] "First sentence" "This is a second sentence" "I like pets e.g" "cats or birds."
是否有可能排除特殊模式,例如&#34;例如&#34;?
答案 0 :(得分:4)
在您的模式中,如果之前至少有2个字母数字字符(使用环视),则可以指定要在任何后跟空格的标点符号处进行拆分。这将导致:
unlist(strsplit(text, "(?<=[[:alnum:]]{3})[?!.]\\s", perl=TRUE))
#[1] "First sentence" "This is a second sentence" "I like pets e.g. cats or birds."
如果你想保留标点符号,那么你可以在后视中添加模式并仅在空格上分割:
unlist(strsplit(text, "(?<=[[:alnum:]]{3}[[?!.]])\\s", perl=TRUE))
# [1] "First sentence." "This is a second sentence." "I like pets e.g. cats or birds."
text2 <- "I like pets (cats and birds) and horses. I have 1.8 bn. horses."
unlist(strsplit(text2, "(?<=[[:alnum:]]{3}[?!.])\\s", perl=TRUE))
#[1] "I like pets (cats and birds) and horses." "I have 1.8 bn. horses."
N.B。:如果您在标点符号后面有多个空格,则可以在模式中添加\\s+
而不是\\s
答案 1 :(得分:3)
希望这有帮助!
library(tokenizers)
text = "First sentence. This is a second sentence. I like pets e.g. cats or birds."
tokenize_sentences(text)
输出是:
[[1]]
[1] "First sentence." "This is a second sentence." "I like pets e.g. cats or birds."