在这里,我想通过删除除前两列之外的列中具有相同值的行来帮助修剪文件。
我拥有的文件(制表符分隔,包含数百万行和数十列)
Jack Mike Jones Dan Was
1 2 7 3 4
2 3 9 4 8
T T C T T
T M T T T
W A S I S
我想要的文件(删除除前两个之外的单元格中具有相同值的行)
Jack Mike Jones Dan Was
1 2 7 3 4
2 3 9 4 8
T T C T T
W A S I S
你可以给我一些关于我问题的提示吗?非常感谢。
我在related question中体验过awk,shell和perl的几个优秀脚本。非常感谢帮助者。
答案 0 :(得分:3)
我能想到的最简单的事情(半开玩笑:):
#!/usr/bin/perl
while (<>)
{
my (undef, undef, @flds) = split;
print if 1<scalar keys % {{ map { $_ => 1 } @flds }}
}
它利用临时哈希表来查找每行的唯一列。这里是:
while (<>) # for each line
{
# split the line into columns, discarding the first two
my (undef, undef, @flds) = split;
my %columns = map { $_ => 1 } @flds; # insert the value as key into a hashtable
my @uniq_cols = keys %columns; # get just the keys
my $uniq_count= scalar @uniq_cols; # count the keys
print if 1<$uniq_count # if count == 1, all columns are the same
}
更明确地说,'map'调用大致相当于通常的习语:
# my %columns = map { $_ => 1 } @flds;
my %columns;
foreach $fld (@flds)
{
$columns{$fld}++; # actually the map version does '$columns{$fld} = 1;' every time
}
HTH
答案 1 :(得分:3)
awk '{
val=$3
for (i=4; i<=NF; i++)
if (val != $i) {
print
break
}
}'
答案 2 :(得分:1)
试试这个:perl -ne 'next if /^\w+\W+\w+\W+(\w+)(\W+\1)+\W*$/; print;'
即匹配:
^ beginning of line
\w+ first word
\W+ non-word (like spaces, tabs, etc)
\w+\W+ second word and spaces
(\w+) third word (and remember)
(\W+\1)+ spaces followed by a copy of the third word as many times as necessary
\W* optional trailing spaces
$ end of line