R:从正则表达式中提取子表达式出现

时间:2013-01-24 04:47:21

标签: regex r

我想使用单个正则表达式从字符串中提取少量数据。我制作了一个模式,其中包括这些部分作为括号中的子表达式。在类似perl的环境中,我只是通过像myvar1=$1; myvar2=$2;等代码将这些子表达式传递给变量 - 但是如何在R中执行此操作?  目前,我发现访问这些事件的唯一方法是通过regexec。这不是很方便,因为regexec不支持perl语法和其他原因。这就是我现在要做的事情:

getoccurence <- function(text,rex,n) { # rex is the result of regexec function
  occstart <- rex[[1]][n+1]
  occstop  <- occstart+attr(rex[[1]],'match.length')[n+1]-1
  occtext  <- substr(text,occstart[i],occstop)
  return(occtext)
}
mytext <- "junk text, 12.3456, -01.234, valuable text before comma, all the rest"
mypattern <- "([0-9]+\\.[0-9]+), (-?[0-9]+\\.[0-9]+), (.*),"
rez <- regexec(mypattern, mytext)
var1 <- getoccurence(mytext, rez, 1)  
var2 <- getoccurence(mytext, rez, 2)  
var3 <- getoccurence(mytext, rez, 3)  
显然,这是一个非常笨拙的解决方案,应该有更好的东西。我会很感激任何建议。

3 个答案:

答案 0 :(得分:2)

你看过regmatches吗?

> regmatches(mytext, rez)
[[1]]
[1] "12.3456, -01.234, valuable text before comma," "12.3456"                                      
[3] "-01.234"                     "valuable text before comma"                   

> sapply(regmatches(mytext, rez), function(x) x[4])
[1] "valuable text before comma"

答案 1 :(得分:1)

stringr中,这是str_matchstr_match_all(如果您希望匹配字符串中每个模式的匹配项。str_match返回一个矩阵{{1返回一个矩阵列表

str_match_all

答案 2 :(得分:1)

gsubfn package中的{p> strapplystrapplyc可以一步完成:

> strapplyc(mytext, mypattern)
[[1]]
[1] "12.3456"                    "-01.234"                   
[3] "valuable text before comma"

> # with simplify = c argument
> strapplyc(mytext, mypattern, simplify = c)
[1] "12.3456"                    "-01.234"                   
[3] "valuable text before comma"

> # extract second element only 
> strapply(mytext, mypattern, ... ~ ..2)
[[1]]
[1] "-01.234"

> # specify function slightly differently and use simplify = c
> strapply(mytext, mypattern, ... ~ list(...)[2], simplify = c)
[1] "-01.234"

> # same
> strapply(mytext, mypattern, x + y + z ~ y, simplify = c)
[1] "-01.234"

> # same but also convert to numeric - also can use with other variations above
> strapply(mytext, mypattern, ... ~ as.numeric(..2), simplify = c)
[1] -1.234

在上面的例子中,第三个参数可以是一个函数,或者在示例中是一个转换为函数的公式(LHS表示参数,RHS是正文)。