如何删除R中带有连词的句子

时间:2017-10-06 11:00:50

标签: r regex

我有文字,其中一个例子如下

输入

YES

预期输出为

c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")

我尝试过:

,At the end of the study everything was great\n,Some other sentence\nThe test ended.
,Not sure how to get this regex sorted\n\nHow do I do this

但它根除了整个句子。如何删除其中带有“but”的短语,并在每个句子中保留其余的短语?

2 个答案:

答案 0 :(得分:1)

请注意,您混淆了“\ n”和“/ n”,我确实这样做了。

我对解决方案的想法:

1)只需捕捉“but”之前和之后没有换行符([^ \ n])的所有字符。

2)(编辑)为了解决Wiktors发现的问题,我们还必须检查没有char([^ a-zA-Z])直接在“but”之前或之后。

x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.",
       ",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")

> gsub("[^\n]*[^a-zA-Z]but[^a-zA-Z][^\n]*", "", x)
[1] ",At the end of the study everything was great\n\nSome other sentence\n The test ended."
[2] ",Not sure how to get this regex sorted\n\nHow do I do this" 

答案 1 :(得分:1)

您可以使用

x <- c(",At the end of the study everything was great\n,There is an funny looking thing somewhere but I didn't look at it too hard\nSome other sentence\n The test ended.", ",Not sure how to get this regex sorted\nI don't know how to get rid of sentences between the two nearest carriage returns but without my head spinning\nHow do I do this")
gsub(".*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE, perl=TRUE)
gsub("(?n).*\\bbut\\b.*[\r\n]*", "", x, ignore.case=TRUE)

请参阅R demo online

PCRE模式匹配:

  • .* - 除了换行符之外的任何0 +字符,0或更多,尽可能多
  • \\bbut\\b - 整个字but\b是字边界)
  • .* - 除了换行符之外的任何0 +字符,0或更多,尽可能多
  • [\r\n]* - 0个或更多换行符。

请注意,第一个gsub有一个perl=TRUE参数,使R使用PCRE正则表达式引擎来解析模式,而.与那里的换行符不匹配。第二个gsub使用TRE(默认)正则表达式引擎,并且需要使用(?n)内联修饰符使.无法匹配那里的换行符。