R中具有多字分隔的负面后视

时间:2017-06-30 21:31:37

标签: r regex grep lookbehind

我正在使用R进行一些字符串处理,并希望识别具有某个字根的字符串,这些字符串之前没有某个字根的另一个字。

这是一个简单的玩具示例。假设我想在字符串中的任何地方识别出“cat / s”一词后面没有“dog / s”的字符串。

 tests = c(
   "dog cat",
   "dogs and cats",
   "dog and cat", 
   "dog and fluffy cats",
   "cats and dogs", 
   "cat and dog",  
   "fluffy cats and fluffy dogs")  

使用这种模式,我可以在猫之前拉出有狗的字符串:

 pattern = "(dog(s|).*)(cat(s|))"
 grep(pattern, tests, perl = TRUE, value = TRUE)

[1] "dog cat"  "dogs and cats"   "dog and cat"   "dog and fluffy cats"

我的负面看法背后有问题:

 neg_pattern = "(?<!dog(s|).*)(cat(s|))"
 grep(neg_pattern, tests, perl = TRUE, value = TRUE)
  

grep中的错误(neg_pattern,tests,perl = TRUE,value = TRUE):   正则表达式无效

     

另外:警告信息:   在grep中(neg_pattern,tests,perl = TRUE,value = TRUE):    PCRE模式编译错误     'lookbehind断言不固定长度'     at')(cat(s |))'

我明白。*不是固定的长度,所以如何拒绝在“cat”之前用“dog”分隔的字符串被任意数量的其他单词隔开?

1 个答案:

答案 0 :(得分:0)

我希望这可以提供帮助:

tests = c(
  "dog cat",
  "dogs and cats",
  "dog and cat", 
  "dog and fluffy cats",
  "cats and dogs", 
  "cat and dog",  
  "fluffy cats and fluffy dogs"
)

# remove strings that have cats after dogs
tests = tests[-grep(pattern = "dog(?:s|).*cat(?:s|)", x = tests)]

# select only strings that contain cats
tests = tests[grep(pattern = "cat(?:s|)", x = tests)]

tests

[1] "cats and dogs"               "cat and dog"                
[3] "fluffy cats and fluffy dogs"

我不确定你是否想用一个表达式做这个,但是 当迭代应用时,正则表达式仍然非常有用。