正则表达式删除不包含特定字符串的换行符

时间:2019-01-18 19:19:36

标签: regex shell perl regex-negation regex-lookarounds

我有一个带分隔符的数据文件,需要清除用户输入。具体来说:

  1. 我要删除的自由文本字段中嵌入了换行符
  2. 列数可以从一行更改为下一行
  3. 每行的第一字段始终应以模式"INC\d{12}"开头(双引号是模式的一部分)。
  4. 如果每个\n不能紧跟模式"INC\d{12}",则应该用一个空格代替
  5. 我目前在cygwin中使用Perl(首选),但也可以使用awk或sed答案。

以下是一些模拟输入数据(我将其保存到名为test_input_so.txt的文件中):

"INC000111111111", "field2", "field3"

"INC000222222222", "field2", "field3","INC000123456789 blahblah"



"INC000444444444", "fie"""ld2", "field3"
"INC000123

456789", "field2", "field3",
"INC000333333333", "INC000123456789", "field3""
"INC000555555555", "field2", "fiel
d3","field4"

以下是上述数据的期望输出:

"INC000111111111", "field2", "field3"    
"INC000222222222", "field2", "field3","INC000123456789 blahblah"
"INC000444444444", "fie"""ld2", "field3"
"INC000123456789", "field2", "field3",
"INC000333333333", "INC000123456789", "field3""
"INC000555555555", "field2", "field3","field4"

我已经尝试了几种否定先行/后退组合,但是我不确定为什么它不起作用。

这里是一个例子:

perl -pe 's/\n(?!"INC\d{12})/ /g;' test_input_so.txt 

它会剥离所有\n,但是会错误地删除\n,后接应该留在原处。

3 个答案:

答案 0 :(得分:2)

perl -pe ...一次只能处理一行,因此多行正则表达式不会带来任何好处。

-0切换到Perl可以更改输入记录分隔符(Perl的行概念是什么),并允许您将整个输入作为单个字符串进行操作。

perl -0777 -pe 's/\n(?!"INC\d{12})/ /g;' test_input_so.txt

答案 1 :(得分:2)

首先,您需要修正一些流浪引号,以便您的数据可以是有效的CSV:

  • 第7行:"fie"""ld2"必须为"fie""ld2"
  • 第11行:以2个双引号结束

第二,不要在字段之间的逗号后加空格:不是a, b而是a,b

一旦修复了这些问题,就可以使用Text::CSV模块:

我认为您真正想做的是删除 带引号的字段中的换行符。该代码的结构摘自Text :: CSV perldoc。

perl -MData::Dump=dd -E '
    use Text::CSV;
    my $csv = Text::CSV->new ({ binary => 1, always_quote => 1 })
                   or die "Cannot use CSV: ".Text::CSV->error_diag ();

    my $file = shift @ARGV;
    open my $fh, "<:encoding(utf8)", $file or die;
    while ( my $row = $csv->getline( $fh ) ) {
        my @row = map {s/\n//g; $_} @$row;
        $csv->combine(@row);
        my $line = $csv->string();
        say $line if $line ne q{""};
    }
    $csv->eof or $csv->error_diag();
    close $fh;
' test_input_so.txt
"INC000111111111","field2","field3"
"INC000222222222","field2","field3","INC000123456789 blahblah"
"INC000444444444","fie""ld2","field3"
"INC000123456789","field2","field3",""
"INC000333333333","INC000123456789","field3"
"INC000555555555","field2","field3","field4"

答案 2 :(得分:0)

另一个Perl

$  perl -0777 -ne ' while( /(^"INC00.+?)(\n"INC.*|\Z)/msg ) { $x=$1;$_=$2; $x=~s/\n//g; print "$x\n" } ' test_input_so.txt
"INC000111111111", "field2", "field3"
"INC000222222222", "field2", "field3","INC000123456789 blahblah"
"INC000444444444", "fie"""ld2", "field3"
"INC000123456789", "field2", "field3",
"INC000333333333", "INC000123456789", "field3""
"INC000555555555", "field2", "field3","field4"
$

输入:

$ cat test_input_so.txt
"INC000111111111", "field2", "field3"

"INC000222222222", "field2", "field3","INC000123456789 blahblah"



"INC000444444444", "fie"""ld2", "field3"
"INC000123

456789", "field2", "field3",
"INC000333333333", "INC000123456789", "field3""
"INC000555555555", "field2", "fiel
d3","field4"

$