Question

我试图从R中获得完整的RegEx匹配，但我似乎只能得到字符串的第一部分。

使用http://regexpal.com/我可以确认我的RegEx是好的，它符合我的预期。在我的数据中，＆＃34;错误类型＆＃34;在以星号和下一个逗号开头的数字之间找到。所以我希望在第一个实例中返回"*20508436572 access forbidden by rule"，在第二个实例中返回"*20508436572 some_error"。

示例：

library(stringr)

regex.errortype<-'\\*\\d+\\s[^,\\n]+'
test_string1<-'2014/08/07 08:28:56 [error] 21278#0: *20508436572 access forbidden by rule, client: 111.222.111.222'
test_string2<-'2014/08/07 08:28:56 [error] 21278#0: *20508436572 some_error, client: 111.222.111.222'

str_extract(test_string1, regex.errortype)
str_extract_all(test_string1, regex.errortype)
regmatches(test_string, regexpr(regex.errortype, test_string1))

str_extract(test_string2, regex.errortype)
str_extract_all(test_string2, regex.errortype)
regmatches(test_string2, regexpr(regex.errortype, test_string2))

结果：

> str_extract(test_string1, regex.errortype)
[1] "*20508436572 access forbidde"
> str_extract_all(test_string1, regex.errortype)
[[1]]
[1] "*20508436572 access forbidde"

> regmatches(test_string1, regexpr(regex.errortype, test_string1))
[1] "*20508436572 access forbidde"

> str_extract(test_string2, regex.errortype)
[1] "*20508436572 some_error"
> str_extract_all(test_string2, regex.errortype)
[[1]]
[1] "*20508436572 some_error"

> regmatches(test_string2, regexpr(regex.errortype, test_string2))
[1] "*20508436572 some_error"

如您所见，较长的匹配被截断，但较短的匹配被正确解析。

我在这里遗漏了什么，或者还有其他一些方法可以让全部比赛回来吗？

干杯，

安迪。

Answer 1

 str_extract_all(test_string1, perl("(?<=\\#[0-9]\\: )\\*\\d+\\s[^,\\n]+"))[[1]]
#[1] "*20508436572 access forbidden by rule"

str_extract_all(test_string2, perl("(?<=\\#[0-9]\\: )\\*\\d+\\s[^,\\n]+"))[[1]]
#[1] "*20508436572 some_error"

使用Lookbehind

(?<=\\#寻找#

[0-9]后跟一个数字

\\:后跟:和空格

然后使用你的模式

Answer 2

这是一个gsub方法，可以在不重写正则表达式的情况下删除所需的字符串。

> gsub("((.*)[*])|([,](.*))", "", c(test_string1, test_string2))
# [1] "20508436572 access forbidden by rule" 
# [2] "20508436572 some_error"

在正则表达式((.*)[*])|([,](.*))中，

((.*)[*])删除*字符以外的所有内容。
|表示＆＃34;或＆＃34;
([,](.*))删除逗号及其后的所有内容。

R正则表达式：如何获得完整匹配的字符串

2 个答案: