如何使perl单行"行结尾不可知"

时间:2013-10-30 12:34:55

标签: regex perl newline

我在perl oneliner上失败了一个小时,因为该文件有CRLF行结尾。它在行尾有一个带有组匹配的正则表达式,并且CR包含在匹配中,使用反向引用进行替换是坏事。

我最终在正则表达式中手动指定了CRLF,但有没有办法让perl句柄自动换行它们是什么

原始命令是

perl -pe  's/foo bar(.*)$/foo $1 bar/g' file.txt

“正确”命令是

perl -pe  's/foo bar(.*)\r\n/foo $1 bar\r\n/g' file.txt

我知道我也可以在处理之前转换行结尾,我对如何让Perl优雅地处理这个案例感兴趣。

示例文件(使用CRLF行结尾保存!)

[19:06:57.033] foo barmy
[19:06:57.033] foo baryour

预期输出

[19:06:57.033] foo my bar
[19:06:57.033] foo your bar

使用原始命令输出(bar开始在行开始,因为它与回车匹配):

bar:06:57.033] foo my
bar:06:57.033] foo your

5 个答案:

答案 0 :(得分:6)

首先,让我们记住

perl -ple's/foo bar(.*)\z/foo $1 bar/g' file.txt

是接近

的缩写
perl -e'
   while (<>) {
      chomp;
      s/foo bar(.*)\z/foo $1 bar/g;
      print $_, $/;
   }
' file.txt

Perl使得代码可以以独立于平台的方式读/写本地文本文件。

在评论中,您询问了如何以独立于平台的方式读取/写入本地文本文件和外部文本文件。

首先,您必须禁用Perl的正常处理。

binmode STDIN;
binmode STDOUT;

然后你必须处理多行结尾。

sub mychomp { (@_ ? $_[0] : $_) =~ s/(\s*)\z//; $1 }

while (<STDIN>) {
   my $le = mychomp($_);
   s/foo bar(.*)\z/foo $1 bar/g;
   print($_, $le);
}

所以而不是

perl -ple's/foo bar(.*)\z/foo $1 bar/g' file.txt

你会有

perl -e'
   sub mychomp { (@_ ? $_[0] : $_) =~ s/(\s*)\z//; $1 }

   binmode STDIN;
   binmode STDOUT;
   while (<STDIN>) {
      my $le = mychomp($_);
      s/foo bar(.*)\z/foo $1 bar/g;
      print($_, $le);
   }
' <file

答案 1 :(得分:4)

在较新的perls中,您可以在正则表达式中使用\R来删除所有行尾字符(包括\n\r)。请参阅perldoc perlre

答案 2 :(得分:1)

你可以说:

perl -pe 's/foo bar([^\015]*)(\015?\012)/foo $1 bar$2/g' *.txt

将保留行结尾,即与输入文件相同。


您可能还想引用perldoc perlport

答案 3 :(得分:1)

  

有没有办法让perl句柄自动进行特定于平台的行结束?

是。这实际上是默认值。

问题是您正在尝试在unix平台上处理Windows行结尾。

这肯定会这样做:

perl -pe'
    BEGIN {
       binmode STDIN,  ":crlf";
       binmode STDOUT, ":crlf";
    }
    s/foo bar(.*)$/foo $1 bar/g;
' <file.txt

我建议您继续手动操作吗?

或者,您可以将文件转换为文本文件并将其转换回来。

<file.orig dos2unix | perl -pe'...' | unix2dos >file.new

答案 4 :(得分:1)

The \R escape sequence Perl v5.10+; see perldoc rebackslash or the documentation online, which matches "generic newlines" (platform-agnostically) can be made to work here (example uses Bash to create the multi-line input string):

$ printf 'foo barmy\r\nfoo baryour\r\n' | perl -pe 's/foo bar(.*?)\R/foo $1 bar\n/gm'
foo my bar
foo your bar

Note that the only difference to Ether's answer is use of a non-greedy construct (.*? rather than just .*), which makes all the difference here.

Read on, if you want to know more.


Background:

It is an example of a pitfall associated with \R, which stems from the fact that it can match one or two characters - either \r\n or, typically, \n:[1]

With the greedy (.*) construct , "my\r" - including the \r - is captured, because the regex engine apparently only backtracks by one character to look for \R, which the remaining \n by itself also satisfies.

By contrast, using the non-greedy (.*?) construct causes \R to match the \r\n sequence, as intended.

[1] \R matches MORE than just \r\n and \n: it matches any single character that is classified as vertical whitespace in Unicode terms, which also includes \v (vertical tab), \f (form feed), \r (by itself), and the following Unicode chars: 0x133 (NEXT LINE), 0x2028 (LINE SEPARATOR), 0x8232 (LINE SEPARATOR) and 0x8233 (PARAGRAPH SEPARATOR)