Question

我有以下字符向量：

"This is a very long character vector. Why is it so long? I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"

我希望使用以下模式将其拆分为句子（即句号 - 空格 - 大写字母）：

"This is a very long character vector."
"Why is it so long? I want to split this vector into senteces by using e.g. strssplit."
"Can someone help me?"
"That would be nice?"

因此，缩写后的句号不应该是新句子。我想在R。

中使用正则表达式来做到这一点

有人可以帮助我吗？

Answer 1

使用strsplit的解决方案：

string <- "This is a very long character vector. Why is it so long? I think lng. is short for long. I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"
unlist(strsplit(string, "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))

结果：

[1] "This is a very long character vector."                             
[2] "Why is it so long?"                                                
[3] "I think lng. is short for long."                                   
[4] "I want to split this vector into senteces by using e.g. strssplit."
[5] "Can someone help me?"                                              
[6] "That would be nice?"

这匹配任何标点符号，后跟空格和大写字母。 (?<=[[:punct:]])在匹配分隔符之前将字符串中的标点符号保留在字符串中，(?=[A-Z])将匹配的大写字母添加到匹配分隔符后的字符串中。

修改我刚刚看到你在你想要的输出中的问号后没有拆分。如果你只想在＆＃34;之后分开。＆＃34;你可以用这个：

unlist(strsplit(string, "(?<=\\.)\\s(?=[A-Z])", perl = T))

给出了

[1] "This is a very long character vector." [2] "Why is it so long? I think lng. is short for long." [3] "I want to split this vector into senteces by using e.g. strssplit." [4] "Can someone help me? That would be nice?"

Answer 2

您可以使用包tokenizers：

library(tokenizers)
tokenize_sentences(x)

其中x是你的角色向量。它导致

[[1]]
[1] "This is a very long character vector."

[[2]]
[1] "Why is it so long?"                                                
[2] "I want to split this vector into senteces by using e.g. strssplit."

[[3]]
[1] "Can someone help me?"

[[4]]
[1] "That would be nice?"

然后，您可以使用unlist删除列表结构。

将字符向量拆分为句子

2 个答案: