我正在使用R进行一些字符串处理,并希望识别具有某个字根的字符串,这些字符串之前没有某个字根的另一个字。
这是一个简单的玩具示例。假设我想在字符串中的任何地方识别出“cat / s”一词后面没有“dog / s”的字符串。
tests = c(
"dog cat",
"dogs and cats",
"dog and cat",
"dog and fluffy cats",
"cats and dogs",
"cat and dog",
"fluffy cats and fluffy dogs")
使用这种模式,我可以在猫之前拉出做有狗的字符串:
pattern = "(dog(s|).*)(cat(s|))"
grep(pattern, tests, perl = TRUE, value = TRUE)
[1] "dog cat" "dogs and cats" "dog and cat" "dog and fluffy cats"
我的负面看法背后有问题:
neg_pattern = "(?<!dog(s|).*)(cat(s|))"
grep(neg_pattern, tests, perl = TRUE, value = TRUE)
grep中的错误(neg_pattern,tests,perl = TRUE,value = TRUE): 正则表达式无效
另外:警告信息: 在grep中(neg_pattern,tests,perl = TRUE,value = TRUE): PCRE模式编译错误 'lookbehind断言不固定长度' at')(cat(s |))'
我明白。*不是固定的长度,所以如何拒绝在“cat”之前用“dog”分隔的字符串被任意数量的其他单词隔开?
答案 0 :(得分:0)
我希望这可以提供帮助:
tests = c(
"dog cat",
"dogs and cats",
"dog and cat",
"dog and fluffy cats",
"cats and dogs",
"cat and dog",
"fluffy cats and fluffy dogs"
)
# remove strings that have cats after dogs
tests = tests[-grep(pattern = "dog(?:s|).*cat(?:s|)", x = tests)]
# select only strings that contain cats
tests = tests[grep(pattern = "cat(?:s|)", x = tests)]
tests
[1] "cats and dogs" "cat and dog"
[3] "fluffy cats and fluffy dogs"
我不确定你是否想用一个表达式做这个,但是 当迭代应用时,正则表达式仍然非常有用。