需要找到一个模式并将其提取出来

时间:2019-02-14 15:41:23

标签: r regex

我的数据框有这些行

"110231 validation 108871 validation 85933"
"21102 validation 93442 21232 validation 73769 26402 validation 127221 26402"
"99763 99763 validation 99763 validation 99763"
"validation 199022 validation 122099 validation 12209 validation 199022 validation 199022 validation 122099"

每个用逗号分隔的字符串是一个新行,我需要提取出第一行验证码,并在每行后面跟随数字。怎么办?

每行的预期输出应为

"validation 108871"
"validation 93442"
"validation 99763"
"validation 199022"

1 个答案:

答案 0 :(得分:1)

在这方面,我将采取两种实施措施。

首先,我将使用character向量。如果您的笔记本在框架中,请替换为myframe$mycolumn

v <- c("110231 validation 108871 validation 85933",
"21102 validation 93442 21232 validation 73769 26402 validation 127221 26402",
"99763 99763 validation 99763 validation 99763",
"validation 199022 validation 122099 validation 12209 validation 199022 validation 199022 validation 122099")

提取“验证码”匹配项

re <- gregexpr("validation [0-9]+", v)
re
# [[1]]
# [1]  8 26
# attr(,"match.length")
# [1] 17 16
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
# [[2]] ...

我们可以使用regmatches提取匹配的子字符串:

regmatches(v, re)
# [[1]]
# [1] "validation 108871" "validation 85933" 
# [[2]]
# [1] "validation 93442"  "validation 73769"  "validation 127221"
# [[3]]
# [1] "validation 99763" "validation 99763"
# [[4]]
# [1] "validation 199022" "validation 122099" "validation 12209" 
# [4] "validation 199022" "validation 199022" "validation 122099"

现在我们有了一个列表,其中您的每个字符串产生1个或多个匹配的子字符串。现在我们可以遍历列表并仅获取第一个元素。

sapply(regmatches(v, re), `[`, 1)
# [1] "validation 108871" "validation 93442"  "validation 99763" 
# [4] "validation 199022"

即使一个字符串不包含子字符串模式,这也不应该失败:

v <- c(v, "nothing here")
re <- gregexpr("validation [0-9]+", v)
sapply(regmatches(v, re), `[`, 1)
# [1] "validation 108871" "validation 93442"  "validation 99763" 
# [4] "validation 199022" NA                 

NA表示没有匹配项,但仍在字符串向量中保留了一个位置。

gsub

首先,删除不包括第一个“验证”的数字/空格:

gsub("^[0-9 ]*(?=validation)", "", v, perl=TRUE)
# [1] "validation 108871 validation 85933"                                                                        
# [2] "validation 93442 21232 validation 73769 26402 validation 127221 26402"                                     
# [3] "validation 99763 validation 99763"                                                                         
# [4] "validation 199022 validation 122099 validation 12209 validation 199022 validation 199022 validation 122099"

现在删除第一个“数字”之后的所有内容:

gsub("([0-9])\\b.*", "", gsub("^[0-9 ]*(?=validation)", "", v, perl=TRUE))
# [1] "validation 10887" "validation 9344"  "validation 9976"  "validation 19902"