我在perl oneliner上失败了一个小时,因为该文件有CRLF行结尾。它在行尾有一个带有组匹配的正则表达式,并且CR包含在匹配中,使用反向引用进行替换是坏事。
我最终在正则表达式中手动指定了CRLF,但有没有办法让perl句柄自动换行它们是什么?
原始命令是
perl -pe 's/foo bar(.*)$/foo $1 bar/g' file.txt
“正确”命令是
perl -pe 's/foo bar(.*)\r\n/foo $1 bar\r\n/g' file.txt
我知道我也可以在处理之前转换行结尾,我对如何让Perl优雅地处理这个案例感兴趣。
示例文件(使用CRLF行结尾保存!)
[19:06:57.033] foo barmy
[19:06:57.033] foo baryour
预期输出
[19:06:57.033] foo my bar
[19:06:57.033] foo your bar
使用原始命令输出(bar开始在行开始,因为它与回车匹配):
bar:06:57.033] foo my
bar:06:57.033] foo your
答案 0 :(得分:6)
首先,让我们记住
perl -ple's/foo bar(.*)\z/foo $1 bar/g' file.txt
是接近
的缩写perl -e'
while (<>) {
chomp;
s/foo bar(.*)\z/foo $1 bar/g;
print $_, $/;
}
' file.txt
Perl使得代码可以以独立于平台的方式读/写本地文本文件。
在评论中,您询问了如何以独立于平台的方式读取/写入本地文本文件和外部文本文件。
首先,您必须禁用Perl的正常处理。
binmode STDIN;
binmode STDOUT;
然后你必须处理多行结尾。
sub mychomp { (@_ ? $_[0] : $_) =~ s/(\s*)\z//; $1 }
while (<STDIN>) {
my $le = mychomp($_);
s/foo bar(.*)\z/foo $1 bar/g;
print($_, $le);
}
所以而不是
perl -ple's/foo bar(.*)\z/foo $1 bar/g' file.txt
你会有
perl -e'
sub mychomp { (@_ ? $_[0] : $_) =~ s/(\s*)\z//; $1 }
binmode STDIN;
binmode STDOUT;
while (<STDIN>) {
my $le = mychomp($_);
s/foo bar(.*)\z/foo $1 bar/g;
print($_, $le);
}
' <file
答案 1 :(得分:4)
在较新的perls中,您可以在正则表达式中使用\R
来删除所有行尾字符(包括\n
和\r
)。请参阅perldoc perlre。
答案 2 :(得分:1)
你可以说:
perl -pe 's/foo bar([^\015]*)(\015?\012)/foo $1 bar$2/g' *.txt
将保留行结尾,即与输入文件相同。
您可能还想引用perldoc perlport
。
答案 3 :(得分:1)
有没有办法让perl句柄自动进行特定于平台的行结束?
是。这实际上是默认值。
问题是您正在尝试在unix平台上处理Windows行结尾。
这肯定会这样做:
perl -pe'
BEGIN {
binmode STDIN, ":crlf";
binmode STDOUT, ":crlf";
}
s/foo bar(.*)$/foo $1 bar/g;
' <file.txt
我建议您继续手动操作吗?
或者,您可以将文件转换为文本文件并将其转换回来。
<file.orig dos2unix | perl -pe'...' | unix2dos >file.new
答案 4 :(得分:1)
The \R
escape sequence Perl v5.10+; see perldoc rebackslash
or the documentation online, which matches "generic newlines" (platform-agnostically) can be made to work here (example uses Bash to create the multi-line input string):
$ printf 'foo barmy\r\nfoo baryour\r\n' | perl -pe 's/foo bar(.*?)\R/foo $1 bar\n/gm'
foo my bar
foo your bar
Note that the only difference to Ether's answer is use of a non-greedy construct (.*?
rather than just .*
), which makes all the difference here.
Read on, if you want to know more.
Background:
It is an example of a pitfall associated with \R
, which stems from the fact that it can match one or two characters - either \r\n
or, typically, \n
:[1]
With the greedy (.*)
construct , "my\r"
- including the \r
- is captured, because the regex engine apparently only backtracks by one character to look for \R
, which the remaining \n
by itself also satisfies.
By contrast, using the non-greedy (.*?)
construct causes \R
to match the \r\n
sequence, as intended.
[1] \R
matches MORE than just \r\n
and \n
: it matches any single character that is classified as vertical whitespace in Unicode terms, which also includes \v
(vertical tab), \f
(form feed), \r
(by itself), and the following Unicode chars: 0x133 (NEXT LINE)
, 0x2028 (LINE SEPARATOR)
, 0x8232 (LINE SEPARATOR)
and 0x8233 (PARAGRAPH SEPARATOR)