R regex用于提取复杂的字符串

时间:2015-09-26 07:44:56

标签: regex r

我有一组凌乱的字符串如下。

string <- c("GRP-14994/", "GRP-7056 GRP-7036/", "grp-24263(24263)/IRGC 28588", "GRP-15916 /IRGC-42176",
            "GRP-614-250B/", "( GRP 11432)/IRGC-14570", "Tourn", "GRPP256", "Purse", "GRP-14956 Origin:", "GRP 10537", "GRP-10096 Origin: ",
            "SGRP123", "GRP1234", "AC-30009 (GRPHANA)/", "AC-3060 GRP 536-143/Old AC", "RGRPfaa/23", "/-",
            "MGR:7251/", "1216-GR-567/", "X:1 Well KGRPh", "WabGRPvea(II)", "HR33(BGRP)", "Tensor",
            "Wald", "grp12312")

我正在尝试提取GRP后跟数字的所有实例,这些数字可能用空格或“ - ”分隔。

我目前的尝试给了我以下结果。

gsub("(.*)(\\b)(GRP)(-|\\s|)(\\d+)(\\/|\\b)(.*)","\\3\\5", string, ignore.case = T)
 [1] "GRP14994"            "GRP7056"             "grp24263"            "GRP15916"           
 [5] "GRP614"              "GRP11432"            "Tourn"               "GRPP256"            
 [9] "Purse"               "GRP14956"            "GRP10537"            "GRP10096"           
[13] "SGRP123"             "GRP1234"             "AC-30009 (GRPHANA)/" "GRP536"             
[17] "RGRPfaa/23"          "/-"                  "MGR:7251/"           "1216-GR-567/"       
[21] "X:1 Well KGRPh"      "WabGRPvea(II)"       "HR33(BGRP)"          "Tensor"             
[25] "Wald"                "grp12312"      

但是所需的输出值

out <-  c("GRP14994", "GRP7056 GRP7036", "grp24263", "GRP15916", "GRP614250", 
"GRP11432", "", "", "", "GRP14956", "GRP10537", "GRP10096", "", 
"GRP1234", "", "GRP536143", "", "", "", "", "", "", "", "", "", 
"grp12312")

out
 [1] "GRP14994"        "GRP7056 GRP7036" "grp24263"        "GRP15916"        "GRP614250"       "GRP11432"       
 [7] ""                ""                ""                "GRP14956"        "GRP10537"        "GRP10096"       
[13] ""                "GRP1234"         ""                "GRP536143"       ""                ""               
[19] ""                ""                ""                ""                ""                ""               
[25] ""                "grp12312"    

如何修改正则表达式以获得所需的结果?

2 个答案:

答案 0 :(得分:1)

unlist(lapply(str_extract_all(string,"[Gg][rR][pP][-\\s]?\\d+"), function (x) { gsub("[-\\s]+(\\d)", "\\1", paste(x, collapse= " "),perl=T) }))
 [1] "GRP14994"        "GRP7056 GRP7036" "grp24263"       
 [4] "GRP15916"        "GRP614"          "GRP11432"       
 [7] ""                ""                ""               
[10] "GRP14956"        "GRP10537"        "GRP10096"       
[13] "GRP123"          "GRP1234"         ""               
[16] "GRP536"          ""                ""               
[19] ""                ""                ""               
[22] ""                ""                ""               
[25] ""                "grp12312"  

答案 1 :(得分:1)

你的模式

(.*)(\\b)(GRP)(-|\\s|)(\\d+)(\\/|\\b)(.*)","\\3\\5

您的模式中发现的错误

  

1 。你想要捕获像GRP-668-888这样的东西,但在你的   你提供了一个只有连字符后面跟一个数字的选项   即GRP-668

     

<强> 2 即可。由于你没有使用其他词,所以没有必要   模式前后的贪婪表达式(.*)。您可以   只需使用",因为它总是在GRP

之前      

第3 即可。此外,\\b之前不需要边界(GRP)   你的模式。

这些是我现在可以检测到的重要内容。

您也可以尝试下面的模式

gsub("(grp)[-\s]?(\d+)[-\s]?(\d+)","\\1\\2\\3", string, ignore.case = T)

grp: 会在字符串

中捕获grp

[-\s]?: 捕获连字符-或空格\s,可以选择

(\d+): 会捕获一个或多个号码

请参阅DEMO