我有一个包含12列的逗号分隔文件。
第5列和第6列存在问题(第5列和第6列中的文字相同,但它们之间可能有额外的逗号),其中包含额外的逗号。
2011,123456,1234567,12345678,Hey There,How are you,Hey There,How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two
所以在上面的例子中“嘿那里,你好吗”不应该有逗号。
我需要删除第5和第6列中的额外逗号。
答案 0 :(得分:4)
如果您想要删除第5个逗号,请尝试
sed 's/,//5' input.txt
但是你说,可能有额外的逗号。你必须提供一个逻辑,以确定是否有额外的逗号。
如果您知道逗号的数量,可以使用awk。事实证明这是一个非常好的练习,我相信其他人会提出一个更优雅的解决方案,但无论如何我都会分享我的:
awk -f script.awk input.txt
使用script.awk:
BEGIN{
FS=","
}
NF<=12{
print $0
}
NF>12{
for (i=1; i<=4; i++) printf $i FS
for (j=0; j<2; j++){
for (i=0; i<=(NF-12)/2; i++){
printf $(i+5)
if (i<(NF-12)/2) printf "_"
else printf FS
}
}
for (i=NF-5; i<=NF; i++) printf $i FS
printf "n"
}
首先,我们将字段分隔符设置为,
。如果我们计算的数字小于或等于12
字段,那么一切都很好,我们只需打印整行。如果有超过12个字段,我们首先打印前4个字段(再次使用字段分隔符),然后我们打印两次字段5(和字段6),但不是打印,
,而是交换它与_
。最后,我们打印剩下的字段。
正如我所说,这可能是一个更优雅的解决方案。我想知道其他人出现了什么。
答案 1 :(得分:2)
如果所有其他字段都是数字字段,您可以尝试按该条件保存有用的逗号。
sed -r 's/(,)[0-9]/;/g' a | sed -r 's/[0-9](,)/;/g' | sed -r 's/,//g' | awk -F\; '{ print $1 "," $2 "," $3 "," $4 "," substr($5, 0, length($5)/2) "," substr($5, length($5)/2 +1, length($5)/2) "," $6 "," $7}'
2011,23456,234567,234567,Hey ThereHow are you,Hey ThereHow are you,8286430903,
答案 2 :(得分:1)
您可以尝试使用perl及其Text::CSV_XS
模块:
#!/usr/bin/env perl
use warnings;
use strict;
use Text::CSV_XS;
my (@columns);
open my $fh, '<', shift or die;
my $csv = Text::CSV_XS->new or die;
while ( my $row = $csv->getline( $fh ) ) {
undef @columns;
if ( @$row <= 12 ) {
@columns = @$row;
next;
}
my $extra_columns = ( @$row - 12 ) / 2;
my $post_columns_index = 4 + 2 * $extra_columns * 2;
@columns = (
@$row[0..3],
(join( '', @$row[4..(4+$extra_columns)] )) x 2,
@$row[$post_columns_index..$#$row]
);
}
continue {
$csv->print( \*STDOUT, \@columns );
printf "\n";
}
假设输入文件(infile
)有三行,其中第一行有另外一个逗号,第二行有另外两个逗号,第三行是正确的:
2011,123456,1234567,12345678,Hey There,How are you,Hey There,How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two
2011,123456,1234567,12345678,Hey There,How are you,now,Hey There,How are you,now,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two
2011,123456,1234567,12345678,Hey There:How are you,Hey There:How are you,882864309037,ABC ABCD,LABACD,1.00000000,80.2500000,One Two
运行如下脚本:
perl script.pl infile
产量:
2011,123456,1234567,12345678,"Hey ThereHow are you","Hey ThereHow are you",882864309037,"ABC ABCD",LABACD,1.00000000,80.2500000,"One Two"
2011,123456,1234567,12345678,"Hey ThereHow are younow","Hey ThereHow are younow",LABACD,1.00000000,80.2500000,"One Two"
2011,123456,1234567,12345678,"Hey There:How are you","Hey There:How are you",882864309037,"ABC ABCD",LABACD,1.00000000,80.2500000,"One Two"
请注意,它会添加一些引号,但它基于csv
规范是正确的,并且更容易处理以前的状态。