在R中的括号之间提取字符串

时间:2013-12-19 21:20:42

标签: regex r

我必须在R中一个非常特殊的特征之间提取值。例如。

 a <- "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}
{3:{112:123123214321}}{4:20:asdasd3214213}"

这是我的示例字符串,我希望在{[0-9]:和}之间提取文本,以便我对上面字符串的输出看起来像

## output should be 
"0987617820" "q312132498s7yd09f8sydf987s6df8797yds9f87098", "{112:123123214321}" "20:asdasd3214213"

3 个答案:

答案 0 :(得分:3)

这是一个可怕的黑客,可能打破你的真实数据。理想情况下你可以使用一个解析器,但如果你坚持使用正则表达式......那么......它不是很漂亮

a <- "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}
{3:{112:123123214321}}{4:20:asdasd3214213}"

# split based on }{ allowing for newlines and spaces
out <- strsplit(a, "\\}[[:space:]]*\\{")
# Make a single vector
out <- unlist(out)
# Have an excess open bracket in first
out[1] <- substring(out[1], 2)
# Have an excess closing bracket in last
n <- length(out)
out[length(out)] <-  substring(out[n], 1, nchar(out[n])-1)
# Remove the number colon at the beginning of the string
answer <- gsub("^[0-9]*\\:", "", out)

给出了

> answer
[1] "0987617820"                                 
[2] "q312132498s7yd09f8sydf987s6df8797yds9f87098"
[3] "{112:123123214321}"                         
[4] "20:asdasd3214213"

如果需要为多个字符串执行此操作,可以在函数中包装类似的内容。

答案 1 :(得分:1)

使用PERL。这种方式更加健壮。

a = "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}{3:{112:123123214321}}{4:20:asdasd3214213}"

foohacky = function(str){
    #remove opening bracket
    pt1 = gsub('\\{+[0-9]:', '@@',str)
    #remove a closing bracket that is preceded by any alphanumeric character
    pt2 = gsub('([0-9a-zA-Z])(\\})', '\\1',pt1, perl=TRUE) 
    #split up and hack together the result
    pt3 = strsplit(pt2, "@@")[[1]][-1]
    pt3
}

例如

> foohacky(a)
[1] "0987617820"                                 
[2] "q312132498s7yd09f8sydf987s6df8797yds9f87098"
[3] "{112:123123214321}"                         
[4] "20:asdasd3214213"

它也适用于嵌套

> a = "{1:0987617820}{{3:{112:123123214321}}{4:{20:asdasd3214213}}"
> foohacky(a)
[1] "0987617820"         "{112:123123214321}" "{20:asdasd3214213}"

答案 2 :(得分:1)

这是一种更通用的方式,它会返回{[0-9]:}之间的任何模式,从而允许匹配中有一个{}的嵌套。

regPattern <- gregexpr("(?<=\\{[0-9]\\:)(\\{.*\\}|.*?)(?=\\})", a, perl=TRUE)
a_parse <- regmatches(a, regPattern)
a <- unlist(a_parse)