我有两个使用相同库来处理文档的Perl程序。它们安装在两个不同的服务器上,一个运行Perl 5.12,另一个运行Perl 5.18。
现在我正在输入相同的文件作为两者的输入,所以我可以对输出进行差异以确保它们匹配。我得到了数百个相同的比赛。他们通常正在处理UTF-8文件,我已经注意正确处理该编码。
今天他们都收到了二进制文件,这是我第一次看到差异。我确定一个程序(运行Perl 5.18的程序)在输出之前从文件内容中删除了垂直选项卡,而另一个程序则没有。
我可以把它写成不支持二进制文件,但它仍然困扰我,他们是不同的。我查看了进行处理的库,它包含这一行(它将以这种方式处理文件中的每一行):
$line =~ s/\s//g;
其中一个Perls是否可能认为垂直制表符是空格,而另一个则不是?我该怎么检查?还有你认为我应该研究的其他事情吗?
答案 0 :(得分:7)
自5.18起,vertical tabs are considered whitespace。
没有人能回想起为什么
正则表达式中的\s
与\cK
(垂直标签)不匹配。现在确实如此。鉴于这个角色的极端罕见,预计会有很少的破损。那就是说,这意味着什么:
\s
现在在所有情况下都与垂直制表符匹配。使用
/x
修饰符时,将忽略正则表达式文字中的文字垂直制表符。当将字符串解释为数字时,现在会忽略单独或与其他空格混合的前导垂直制表符。例如:
$dec = " \cK \t 123"; $hex = " \cK \t 0xF"; say 0 + $dec; # was 0 with warning, now 123 say int $dec; # was 0, now 123 say oct $hex; # was 0, now 15
这使得Perl符合Unicode,它将U + 000B LINE TABULATION又称VERTICAL TABULATION又称VT White_Space
字符。
您可以将\s
替换为[^\S\x0B]
来恢复旧行为。
另外值得考虑的是\h
,它只匹配水平空白字符。
U+0009 CHARACTER TABULATION Matched by \s & \h
U+000A LINE FEED Matched by \s & \v
U+000B LINE TABULATION Matched by \s & \v
U+000C FORM FEED Matched by \s & \v
U+000D CARRIAGE RETURN Matched by \s & \v
U+0020 SPACE Matched by \s & \h
U+0085 NEXT LINE Matched by \s & \v
U+00A0 NO-BREAK SPACE Matched by \s & \h
U+1680 OGHAM SPACE MARK Matched by \s & \h
U+2000 EN QUAD Matched by \s & \h
U+2001 EM QUAD Matched by \s & \h
U+2002 EN SPACE Matched by \s & \h
U+2003 EM SPACE Matched by \s & \h
U+2004 THREE-PER-EM SPACE Matched by \s & \h
U+2005 FOUR-PER-EM SPACE Matched by \s & \h
U+2006 SIX-PER-EM SPACE Matched by \s & \h
U+2007 FIGURE SPACE Matched by \s & \h
U+2008 PUNCTUATION SPACE Matched by \s & \h
U+2009 THIN SPACE Matched by \s & \h
U+200A HAIR SPACE Matched by \s & \h
U+2028 LINE SEPARATOR Matched by \s & \v
U+2029 PARAGRAPH SEPARATOR Matched by \s & \v
U+202F NARROW NO-BREAK SPACE Matched by \s & \h
U+205F MEDIUM MATHEMATICAL SPACE Matched by \s & \h
U+3000 IDEOGRAPHIC SPACE Matched by \s & \h