我有以下字符向量:
"This is a very long character vector. Why is it so long? I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"
我希望使用以下模式将其拆分为句子(即句号 - 空格 - 大写字母):
"This is a very long character vector."
"Why is it so long? I want to split this vector into senteces by using e.g. strssplit."
"Can someone help me?"
"That would be nice?"
因此,缩写后的句号不应该是新句子。我想在R。
中使用正则表达式来做到这一点有人可以帮助我吗?
答案 0 :(得分:3)
使用strsplit的解决方案:
string <- "This is a very long character vector. Why is it so long? I think lng. is short for long. I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"
unlist(strsplit(string, "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))
结果:
[1] "This is a very long character vector."
[2] "Why is it so long?"
[3] "I think lng. is short for long."
[4] "I want to split this vector into senteces by using e.g. strssplit."
[5] "Can someone help me?"
[6] "That would be nice?"
这匹配任何标点符号,后跟空格和大写字母。 (?<=[[:punct:]])
在匹配分隔符之前将字符串中的标点符号保留在字符串中,(?=[A-Z])
将匹配的大写字母添加到匹配分隔符后的字符串中。
修改强> 我刚刚看到你在你想要的输出中的问号后没有拆分。如果你只想在&#34;之后分开。&#34;你可以用这个:
unlist(strsplit(string, "(?<=\\.)\\s(?=[A-Z])", perl = T))
给出了
[1] "This is a very long character vector."
[2] "Why is it so long? I think lng. is short for long."
[3] "I want to split this vector into senteces by using e.g. strssplit."
[4] "Can someone help me? That would be nice?"
答案 1 :(得分:1)
您可以使用包tokenizers
:
library(tokenizers)
tokenize_sentences(x)
其中x
是你的角色向量。它导致
[[1]]
[1] "This is a very long character vector."
[[2]]
[1] "Why is it so long?"
[2] "I want to split this vector into senteces by using e.g. strssplit."
[[3]]
[1] "Can someone help me?"
[[4]]
[1] "That would be nice?"
然后,您可以使用unlist
删除列表结构。