strsplit与gregexpr不一致

时间:2014-05-31 11:16:08

标签: regex r pcre strsplit

A comment我对this question的答案应该使用strsplit提供所需的结果不会,即使似乎正确匹配字符向量中的第一个和最后一个逗号。这可以使用gregexprregmatches来证明。

那么为什么strsplit会在此示例中的每个逗号上拆分,即使regmatches仅返回相同正则表达式的两个匹配项? < / p>

#  We would like to split on the first comma and
#  the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"

#  Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34"  "56"  "78"  "90" 


#  Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )

# Matching positions are at
unlist(m)
[1]  4 13

#  And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","

咦?发生了什么事?

1 个答案:

答案 0 :(得分:10)

@Aprillion的理论是准确的,来自R documentation

  

应用于每个输入字符串的算法是

repeat {
    if the string is empty
        break.
    if there is a match
        add the string to the left of the match to the output.
        remove the match and all to the left of it.
    else
        add the string to the output.
        break.
}

换句话说,在每次迭代时^将匹配新字符串的开头(没有先前的项目。)

简单地说明这种行为:

> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""

Here,您可以通过前瞻断言作为分隔符来查看此行为的后果(感谢@ JoshO'Brien的链接。)