应用错误收集

为什么'hallo \ nworld'可以匹配R中的\ n和\\ n？

时间：2013-12-07 07:32:13

标签： regex r escaping

为什么grep以同样的方式对待\n和\\n？

例如，两者都匹配hallo\nworld。

grep("hallo\nworld", pattern="\n")
[1] 1
grep("hallo\nworld", pattern="\\n")
[1] 1

我看到hallo\nworld被解析为

hallo  
world

即，hallo在一行，world在一行。

那么在grep("hallo\nworld", pattern="\n")中，pattern="\n"是新行还是\n字面意思？

另请注意，其他人会这样; \a \f \n \t \r和\\a \\f \\n \\t {{1}所有对待都是相同的。但\\r \d \w无法使用！为什么不呢？

我选择了不同的字符串来测试，我在正则表达式的概念中找到了秘密。

有两个escape的概念，一个是字符串中的escape，它很容易理解;另一个是在常规模式表达式字符串中转义。在R中，\s，grep(x, pattern=" some string here ") = \\n等模式是换行符。但是在常见字符串\n！= \\n中，前者字面上是\n，后者是换行符。我们可以通过以下方式证明：

\n

如何证明这一点？我会尝试使用其他字符，而不仅仅是cat("\n") cat("\\n") \n>，以查看它们是否以相同的方式匹配。

\n

输出：

special1 <- c( "\a", "\f", "\n", "\t", "\r")
special2 <- c("\\a","\\f","\\n","\\t","\\r")
target <- paste("hallo", special1, "world", sep="")
for (i in 1:5){
    cat("i=", i, "\n")
    if( grep(target[i], pattern=special1[i]) == 1)
        print(paste(target[i], "match", special1[i], "succeed"))
    if( grep(target[i], pattern=special2[i]) == 1)
        print(paste(target[i], "match", special2[i], "succeed"))
}

请注意i= 1 [1] "hallo\aworld match \a succeed" [1] "hallo\aworld match `\\a` succeed" i= 2 [1] "hallo\fworld match \f succeed" [1] "hallo\fworld match `\\f` succeed" i= 3 [1] "hallo\nworld match \n succeed" [1] "hallo\nworld match `\\n` succeed" i= 4 [1] "hallo\tworld match \t succeed" [1] "hallo\tworld match `\\t` succeed" i= 5 [1] "hallo\rworld match \r succeed" [1] "hallo\rworld match `\\r` succeed" \a \f \n \t和\r \\a \\f {{1} } \\n在R常规模式表达式字符串中被完全相同地处理！

不仅如此，你不能在R正则表达式模式中写\\t \\r \d！
你可以写下任何一个：

\w

但你不能写任何这些！

\s

我认为这也是一个错误，因为pattern="\a" "pattern=\f" "pattern=\n" "pattern=\t" "pattern=\r" pattern="\d" "pattern="\w" "pattern=\s" in grep. \d对\w \s \a \f的处理不平等} \n。

5 个答案:

答案 0 :(得分：10)

\n，\\n和\\\n都匹配的原因是搜索模式的双重评估。我通过运行几个例子观察到了这一点：

grep("hello\nworld", pattern="\n")
[1] 1
grep("hello\nworld", pattern="\\n")
[1] 1
> grep("hello\nworld", pattern="\\\n")
[1] 1
> grep("hello\nworld", pattern="\\\\n")
integer(0)
> grep("hello\\nworld", pattern="\\\\n")
[1] 1

请记住评估反斜杠转义序列的规则：

\\已替换为\
\n已替换为NEWLINE字符
\ + NEWLINE已替换为NEWLINE字符
（有关详细信息，请参阅?regex中的文档）

考虑到这一点，如果你评估模式两次，你会得到：

\n =＆gt; NEWLINE =＆gt; NEWLINE
\\n =＆gt; \n =＆gt; NEWLINE
\\\n =＆gt; \ + NEWLINE =＆gt; NEWLINE
\\\\n =＆gt; \\n =＆gt; \n
\\\\\n =＆gt; \\ + NEWLINE =＆gt; \ + NEWLINE
\\\\\\n =＆gt; \\\n =＆gt; \ + NEWLINE
\\\\\\\n =＆gt; \\\ + NEWLINE =＆gt; \ + NEWLINE
\\\\\\\\n =＆gt; \\\\n =＆gt; \\n

等等。示例1-3都评估为单个NEWLINE，这就是这些模式匹配的原因。（与此同时，您尝试与模式匹配的字符串仅评估一次。）

A discussion on the R mailing list

@Aaron解释了这样的双重评估：

评估有两个级别，因为反斜杠都是转义字符 R字符串和正则表达式。

请注意，其他语言不会评估此类模式。以Python为例：

import re
>>> re.search(r'\n', 'hello\nworld') is not None
True
>>> re.search(r'\\n', 'hello\nworld') is not None
False

或Perl：

$ perl -e 'print "hello\nworld" =~ /\n/ || 0, "\n"'
1
$ perl -e 'print "hello\nworld" =~ /\\n/ || 0, "\n"'
0

我们可以继续。所以R中的双重评估似乎不同寻常。为什么这样实现？我认为最终的答案在于R-devel。

<强>致谢

我感谢@Aaron，他们的批评意见有助于改善这一答案。

答案 1 :(得分：4)

注意反斜杠本身很特殊，你必须用反斜杠转义反斜杠。

\\n表示“我真的想要匹配换行符，而不是文字\n”

grep("hallo\nworld", pattern = "\\n")
[1] 1

grep("hallo\\nworld", pattern = "\\\\n")
[1] 1

答案 2 :(得分：4)

跟进hwnd的回答，看看以下内容：

cat("x\ny")
## x
## y
cat("x\\ny")
## x\ny
grep("hallo\nworld", pattern="[\n]")
## [1] 1
grep("hallo\nworld", pattern="[\\n]")
## integer(0)

所以："\n"是字面换行符，"\\n"是反斜杠+ n，由grep解释为换行符。这就是为什么在我的第一个例子中找到一个匹配（搜索集{ newline }中的任何字符），在我的第二个例子中找不到匹配（搜索集合{ \ n }中的任何字符）。 / p>

这不是一个错误，它是完全预期的行为。在那个注意事项上，为了真的和绝对确定，你为什么不{R}帮助或R-devel {/ 3}？

答案 3 :(得分：3)

这确实是由于lebatsnok提到的“双重逃避”。正如彼得·达尔加德（Peter Dalgaard）在R-help上所写的那样，“反斜杠是R字符串和正则表达式的转义字符。”见https://stat.ethz.ch/pipermail/r-help/2003-August/037524.html。另请参阅?regex中关于加倍反斜杠的说明，但对我而言，这并不像达尔加德的评论那么明确。

因此\n在第一遍中成为换行符，并在第二遍中保持这种状态。

\\n在第一次传递中变为\n（\\ - ＆gt; \），然后在第二次传递中成为换行符。

\\\n变为\后跟第一遍中的换行符，这显然只是第二遍中的换行符，因为它也匹配。

此外，关于a，f，n，t和r允许使用反斜杠但d，w和s不是的问题，请注意这些是具有特殊含义的特定元字符，如{{1}中所述}：

当前的实现解释 '\ a'为'BEL'，'\ e'为'ESC'，'\ f'为'FF'，'\ n'为'LF'，'\ r'为 'CR'和'\ t'为'TAB'。

答案 4 :(得分：-1)

因为字符串hallo\nworld在解析时包含文字文本\n以及line feed字符。

如果您的字符串实际上是：

hallo
world

然后它只匹配\n而不匹配\\n。