Question

我正在使用R进行一些字符串处理，并希望识别具有某个字根的字符串，这些字符串之前没有某个字根的另一个字。

这是一个简单的玩具示例。假设我想在字符串中的任何地方识别出“cat / s”一词后面没有“dog / s”的字符串。

 tests = c(
   "dog cat",
   "dogs and cats",
   "dog and cat", 
   "dog and fluffy cats",
   "cats and dogs", 
   "cat and dog",  
   "fluffy cats and fluffy dogs")

使用这种模式，我可以在猫之前拉出做有狗的字符串：

 pattern = "(dog(s|).*)(cat(s|))"
 grep(pattern, tests, perl = TRUE, value = TRUE)

[1] "dog cat"  "dogs and cats"   "dog and cat"   "dog and fluffy cats"

我的负面看法背后有问题：

 neg_pattern = "(?<!dog(s|).*)(cat(s|))"
 grep(neg_pattern, tests, perl = TRUE, value = TRUE)

grep中的错误（neg_pattern，tests，perl = TRUE，value = TRUE）：   正则表达式无效

另外：警告信息：   在grep中（neg_pattern，tests，perl = TRUE，value = TRUE）：    PCRE模式编译错误     'lookbehind断言不固定长度'     at'）（cat（s |））'

我明白。*不是固定的长度，所以如何拒绝在“cat”之前用“dog”分隔的字符串被任意数量的其他单词隔开？

Answer 1

我希望这可以提供帮助：

tests = c(
  "dog cat",
  "dogs and cats",
  "dog and cat", 
  "dog and fluffy cats",
  "cats and dogs", 
  "cat and dog",  
  "fluffy cats and fluffy dogs"
)

# remove strings that have cats after dogs
tests = tests[-grep(pattern = "dog(?:s|).*cat(?:s|)", x = tests)]

# select only strings that contain cats
tests = tests[grep(pattern = "cat(?:s|)", x = tests)]

tests

[1] "cats and dogs"               "cat and dog"                
[3] "fluffy cats and fluffy dogs"

我不确定你是否想用一个表达式做这个，但是当迭代应用时，正则表达式仍然非常有用。

R中具有多字分隔的负面后视

1 个答案: