我正在使用strsplit函数来执行此操作。
我为此找到了很多正则表达式:
(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s
首先,当我在R中使用它时,我收到错误:
sl <- unlist(strsplit(txt1,"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s"))
错误:'\ w'是一个无法识别的转义字符串,以“”开头(?
当我尝试测试时 regex tester
它无法解决我的问题 我的段落是:
As of Feb. 9, the Ministry of Agriculture, Fisheries and Food
said that 9,998 cattle have been destroyed after being diagnosed
with BSE. The government has paid $6.1 million in compensation, and is
budgeting $16 million for 1990.
我想要2个句子
As of Feb. 9, the Ministry of Agriculture, Fisheries and Food
said that 9,998 cattle have been destroyed after being diagnosed
with BSE.
The government has paid $6.1 million in compensation, and is
budgeting $16 million for 1990.
但上面的正则表达式将它分为3个句子:
As of Feb.
9, the Ministry of Agriculture, Fisheries and Food said that 9,998 cattle have been destroyed after being diagnosed
with BSE.
The government has paid $6.1 million in compensation, and is
budgeting $16 million for 1990.
答案 0 :(得分:2)
我没有得到你正在尝试处理的两个负面观察((?<!\w\.\w.)(?<![A-Z][a-z]\.)
)。你真的只需要在(?<=\\.|\\?)
(可能添加感叹号?),空格字符\\s
之前搜索句点和问号,然后添加正向预测大写字母:(?=[A-Z])
。
是的,在R中,您需要使用两个反斜杠(\\
)来逃避所有内容,如果您在strsplit
中使用前瞻或后瞻,则需要指定perl = TRUE
}。
总而言之,你真正需要的是
strsplit(txt1, "(?<=\\.|\\?)\\s(?=[A-Z])", perl = TRUE)
给你
[[1]]
[1] "As of Feb. 9, the Ministry of Agriculture, Fisheries and Food said that 9,998 cattle have been destroyed after being diagnosed with BSE."
[2] "The government has paid $6.1 million in compensation, and is budgeting $16 million for 1990."