从逗号分隔文件中删除额外逗号

时间:2013-11-25 16:49:09

标签: regex perl csv sed awk

我有一个包含12列的逗号分隔文件。

第5列和第6列存在问题(第5列和第6列中的文字相同,但它们之间可能有额外的逗号),其中包含额外的逗号。

 2011,123456,1234567,12345678,Hey There,How are you,Hey There,How are you,882864309037,ABC   ABCD,LABACD,1.00000000,80.2500000,One Two

所以在上面的例子中“嘿那里,你好吗”不应该有逗号。

我需要删除第5和第6列中的额外逗号。

3 个答案:

答案 0 :(得分:4)

如果您想要删除第5个逗号,请尝试

sed 's/,//5' input.txt

但是你说,可能有额外的逗号。你必须提供一个逻辑,以确定是否有额外的逗号。

如果您知道逗号的数量,可以使用。事实证明这是一个非常好的练习,我相信其他人会提出一个更优雅的解决方案,但无论如何我都会分享我的:

awk -f script.awk input.txt

使用script.awk:

BEGIN{
    FS=","
}
NF<=12{
    print $0
}
NF>12{
    for (i=1; i<=4; i++) printf $i FS
    for (j=0; j<2; j++){
        for (i=0; i<=(NF-12)/2; i++){
            printf $(i+5)
            if (i<(NF-12)/2) printf "_"
            else printf FS
        }
    }
    for (i=NF-5; i<=NF; i++) printf $i FS
    printf "n"
}

首先,我们将字段分隔符设置为,。如果我们计算的数字小于或等于12字段,那么一切都很好,我们只需打印整行。如果有超过12个字段,我们首先打印前4个字段(再次使用字段分隔符),然后我们打印两次字段5(和字段6),但不是打印,,而是交换它与_。最后,我们打印剩下的字段。

正如我所说,这可能是一个更优雅的解决方案。我想知道其他人出现了什么。

答案 1 :(得分:2)

如果所有其他字段都是数字字段,您可以尝试按该条件保存有用的逗号。

   sed -r 's/(,)[0-9]/;/g' a | sed -r 's/[0-9](,)/;/g' |  sed -r 's/,//g' |  awk -F\; '{ print $1 "," $2 "," $3 "," $4 "," substr($5, 0, length($5)/2) "," substr($5, length($5)/2 +1, length($5)/2) "," $6 "," $7}'
2011,23456,234567,234567,Hey ThereHow are you,Hey ThereHow are you,8286430903,

答案 2 :(得分:1)

您可以尝试使用及其Text::CSV_XS模块:

#!/usr/bin/env perl

use warnings;
use strict;
use Text::CSV_XS;

my (@columns);

open my $fh, '<', shift or die;

my $csv = Text::CSV_XS->new or die;
while ( my $row = $csv->getline( $fh ) ) { 
    undef @columns;
    if ( @$row <= 12 ) { 
        @columns = @$row;
        next;
    }   

    my $extra_columns = ( @$row - 12 ) / 2;
    my $post_columns_index = 4 + 2 * $extra_columns * 2;
    @columns = ( 
        @$row[0..3], 
        (join( '', @$row[4..(4+$extra_columns)] )) x 2,  
        @$row[$post_columns_index..$#$row] 
    );  
}
continue {
    $csv->print( \*STDOUT, \@columns );
    printf "\n";
}

假设输入文件(infile)有三行,其中第一行有另外一个逗号,第二行有另外两个逗号,第三行是正确的:

2011,123456,1234567,12345678,Hey There,How are you,Hey There,How are you,882864309037,ABC   ABCD,LABACD,1.00000000,80.2500000,One Two
2011,123456,1234567,12345678,Hey There,How are you,now,Hey There,How are you,now,882864309037,ABC   ABCD,LABACD,1.00000000,80.2500000,One Two
2011,123456,1234567,12345678,Hey There:How are you,Hey There:How are you,882864309037,ABC   ABCD,LABACD,1.00000000,80.2500000,One Two

运行如下脚本:

perl script.pl infile

产量:

2011,123456,1234567,12345678,"Hey ThereHow are you","Hey ThereHow are you",882864309037,"ABC   ABCD",LABACD,1.00000000,80.2500000,"One Two"
2011,123456,1234567,12345678,"Hey ThereHow are younow","Hey ThereHow are younow",LABACD,1.00000000,80.2500000,"One Two"
2011,123456,1234567,12345678,"Hey There:How are you","Hey There:How are you",882864309037,"ABC   ABCD",LABACD,1.00000000,80.2500000,"One Two"

请注意,它会添加一些引号,但它基于csv规范是正确的,并且更容易处理以前的状态。