R中的否定,如何在R中否定后替换单词?

时间:2017-12-12 14:05:44

标签: r regex nlp pcre

我正在跟进一个问题here,询问如何添加前缀"而不是_"在否定之后的一个词。

在评论中,MrFlick使用正则表达式gsub("(?<=(?:\\bnot|n't) )(\\w+)\\b", "not_\\1", x, perl=T)提出了解决方案。

我想编辑这个正则表达式,以便将 not _ 前缀添加到&#34; not&#34;之后的所有单词中。或&#34; n&#t;#34;直到有一些标点符号。

如果我正在编辑cptn的例子,我想:

x <- "They didn't sell the company, and it went bankrupt" 

转变为:

"They didn't not_sell not_the not_company, and it went bankrupt"

使用反向引用仍然可以解决这个问题吗?如果是这样,任何一个例子将非常感激。谢谢!

3 个答案:

答案 0 :(得分:1)

您可以使用

(?:\bnot|n't|\G(?!\A))\s+\K(\w+)\b

并替换为not_\1。请参阅regex demo

<强>详情

  • (?:\bnot|n't|\G(?!\A)) - 三种选择中的任何一种:
    • \bnot - 全文not
    • n't - n't
    • \G(?!\A) - 上一次成功匹配位置的结束
  • \s+ - 1+空格
  • \K - 匹配重置运算符,丢弃目前为止匹配的文本
  • (\w+) - 第1组(在替换模式中引用\1):1+个字符(数字,字母或_
  • \b - 一个单词边界。

R demo

x <- "They didn't sell the company, and it went bankrupt"
gsub("(?:\\bnot|n't|\\G(?!\\A))\\s+\\K(\\w+)\\b", "not_\\1", x, perl=TRUE)
## => [1] "They didn't not_sell not_the not_company, and it went bankrupt"

答案 1 :(得分:0)

首先,你应该在你想要的标点符号上拆分字符串。例如:

x <- "They didn't sell the company, and it went bankrupt. Then something else"
x_split <- strsplit(x, split = "[,.]") 
[[1]]
[1] "They didn't sell the company" " and it went bankrupt"        " Then something else" 

然后将正则表达式应用于列表x_split的每个元素。最后合并所有部分(如果需要)。

答案 2 :(得分:0)

这不是理想的,但可以完成工作:

x <- "They didn't sell the company, and it did not go bankrupt. That's it" 

gsub("((^|[[:punct:]]).*?(not|n't)|[[:punct:]].*?((?<=\\s)[[:punct:]]|$))(*SKIP)(*FAIL)|\\s", 
     " not_", x, 
     perl = TRUE)

# [1] "They didn't not_sell not_the not_company, and it did not not_go not_bankrupt. That's it"

备注:

这使用(*SKIP)(*FAIL)技巧来避免任何你不想要正则表达式匹配的模式。这基本上用not_替换每个空格,除了它们之间的空间:

  1. 字符串或标点符号的开头"not""n't"

  2. 标点符号和标点符号(后面没有空格)或字符串结尾