我有一组凌乱的字符串如下。
string <- c("GRP-14994/", "GRP-7056 GRP-7036/", "grp-24263(24263)/IRGC 28588", "GRP-15916 /IRGC-42176",
"GRP-614-250B/", "( GRP 11432)/IRGC-14570", "Tourn", "GRPP256", "Purse", "GRP-14956 Origin:", "GRP 10537", "GRP-10096 Origin: ",
"SGRP123", "GRP1234", "AC-30009 (GRPHANA)/", "AC-3060 GRP 536-143/Old AC", "RGRPfaa/23", "/-",
"MGR:7251/", "1216-GR-567/", "X:1 Well KGRPh", "WabGRPvea(II)", "HR33(BGRP)", "Tensor",
"Wald", "grp12312")
我正在尝试提取GRP后跟数字的所有实例,这些数字可能用空格或“ - ”分隔。
我目前的尝试给了我以下结果。
gsub("(.*)(\\b)(GRP)(-|\\s|)(\\d+)(\\/|\\b)(.*)","\\3\\5", string, ignore.case = T)
[1] "GRP14994" "GRP7056" "grp24263" "GRP15916"
[5] "GRP614" "GRP11432" "Tourn" "GRPP256"
[9] "Purse" "GRP14956" "GRP10537" "GRP10096"
[13] "SGRP123" "GRP1234" "AC-30009 (GRPHANA)/" "GRP536"
[17] "RGRPfaa/23" "/-" "MGR:7251/" "1216-GR-567/"
[21] "X:1 Well KGRPh" "WabGRPvea(II)" "HR33(BGRP)" "Tensor"
[25] "Wald" "grp12312"
但是所需的输出值
out <- c("GRP14994", "GRP7056 GRP7036", "grp24263", "GRP15916", "GRP614250",
"GRP11432", "", "", "", "GRP14956", "GRP10537", "GRP10096", "",
"GRP1234", "", "GRP536143", "", "", "", "", "", "", "", "", "",
"grp12312")
out
[1] "GRP14994" "GRP7056 GRP7036" "grp24263" "GRP15916" "GRP614250" "GRP11432"
[7] "" "" "" "GRP14956" "GRP10537" "GRP10096"
[13] "" "GRP1234" "" "GRP536143" "" ""
[19] "" "" "" "" "" ""
[25] "" "grp12312"
如何修改正则表达式以获得所需的结果?
答案 0 :(得分:1)
unlist(lapply(str_extract_all(string,"[Gg][rR][pP][-\\s]?\\d+"), function (x) { gsub("[-\\s]+(\\d)", "\\1", paste(x, collapse= " "),perl=T) }))
[1] "GRP14994" "GRP7056 GRP7036" "grp24263"
[4] "GRP15916" "GRP614" "GRP11432"
[7] "" "" ""
[10] "GRP14956" "GRP10537" "GRP10096"
[13] "GRP123" "GRP1234" ""
[16] "GRP536" "" ""
[19] "" "" ""
[22] "" "" ""
[25] "" "grp12312"
答案 1 :(得分:1)
你的模式
(.*)(\\b)(GRP)(-|\\s|)(\\d+)(\\/|\\b)(.*)","\\3\\5
您的模式中发现的错误
1 。你想要捕获像
GRP-668-888
这样的东西,但在你的 你提供了一个只有连字符后面跟一个数字的选项 即GRP-668
<强> 2 即可。由于你没有使用其他词,所以没有必要 模式前后的贪婪表达式
之前(.*)
。您可以 只需使用"
,因为它总是在GRP
第3 即可。此外,
\\b
之前不需要边界(GRP)
你的模式。
这些是我现在可以检测到的重要内容。
您也可以尝试下面的模式
gsub("(grp)[-\s]?(\d+)[-\s]?(\d+)","\\1\\2\\3", string, ignore.case = T)
grp:
会在字符串
[-\s]?:
捕获连字符-
或空格\s
,可以选择
(\d+):
会捕获一个或多个号码
请参阅DEMO