Question

我正在尝试使用一些我为Python制作的正则表达式也适用于R。

这是我在Python中使用的（使用优秀的re模块），我预期的3个匹配项：

import re
line = 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
re.findall('"(.*?)"', line)
# ['First [T]', 'Second [L]', 'Third [1/T]']

现在有了R，这是我最好的尝试：

line <- 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"'
m <- gregexpr('"(.*?)"', line)
regmatches(line, m)[[1]]
# [1] "\"First [T]\""   "\"Second [L]\""  "\"Third [1/T]\""

为什么R匹配整个模式，而不是仅仅在括号内？我在期待：

[1] "First [T]"   "Second [L]"  "Third [1/T]"

此外，perl=TRUE没有任何区别。是否可以安全地假设R的正则表达式不考虑只匹配括号，或者是否有一些我缺少的技巧？

解决方案摘要：感谢@flodel，它似乎也适用于其他模式，因此它似乎是一个很好的通用解决方案。使用输入字符串line和正则表达式模式pat的紧凑形式的解决方案是：

pat <- '"(.*?)"'
sub(pat, "\\1", regmatches(line, gregexpr(pat, line))[[1]])

此外，如果在perl=TRUE中使用PCRE功能，则gregexpr应添加到pat。

Answer 1

如果您打印m，您会看到gregexpr(..., perl = TRUE)为您提供匹配的位置和长度a）您的完整模式，包括前导和收尾报价以及b）捕获的{{1} }。

不幸的是，当(.*)使用m时，它会使用前者的位置和长度。

我能想到两种解决方案。

将最终输出通过regmatches：

sub

或者使用line <- 'VARIABLES = "First [T]" "Second [L]" "Third [1/T]"' m <- gregexpr('"(.*?)"', line, perl = TRUE) z <- regmatches(line, m)[[1]] sub('"(.*?)"', "\\1", z)使用捕获的表达式的位置和长度：

substring

为了进一步了解，请了解如果您的模式尝试捕获多个内容会发生什么。另请注意，您可以为捕获组指定名称（文档称为 Python样式的命名捕获），此处为start.pos <- attr(m[[1]], "capture.start") end.pos <- start.pos + attr(m[[1]], "capture.length") - 1L substring(line, start.pos, end.pos)和"capture1"：

"capture2"

Answer 2

strapplyc行为中的

1） gsubfn package以您期望的方式行事：

> library(gsubfn)
> strapplyc(line, '"(.*?)"')[[1]]
[1] "First [T]"   "Second [L]"  "Third [1/T]"

2）虽然它涉及深入研究m的属性，但可以通过重构regmatches来引用捕获而使其工作m而不是整场比赛：

at <- attributes( m[[1]] )
m2 <- list( structure( c(at$capture.start), match.length = at$capture.length ) )

regmatches( line, m2 )[[1]]

3）如果我们知道字符串总是以]结尾并且愿意修改正则表达式，那么这将起作用：

> m3 <- gregexpr('[^"]*]', line)
> regmatches( line, m3 )[[1]]
[1] "First [T]"   "Second [L]"  "Third [1/T]"

括号内的正则表达式匹配

2 个答案: