AB006589__ESR2,BC024181__ESR2,0.47796
AB006589__ESR2,X55739__CSN2,0.47232
AB006589__ESR2,NM_004991__MDS1,0.46704
AB006589__ESR2,NM_003476__CSRP3,0.45767
AB006589__ESR2,NM_012101__TRIM29,0.45094
AB006589__ESR2,NM_006897__HOXC9,0.41748
AB006589__ESR2,NM_000278__PAX2,0.4161
NM_003476__CSRP3,AB006589__ESR2,0.45767
NM_012101__TRIM29,AB006589__ESR2,0.45094
NM_006897__HOXC9,AB006589__ESR2,0.41748
NM_000278__PAX2,AB006589__ESR2,0.4161
现在,问题在于第4行
AB006589__ESR2,NM_003476__CSRP3,0.45767
是第8行的副本
NM_003476__CSRP3,AB006589__ESR2,0.45767
在我的大型CSV文件中有很多这种情况。
所以,我的问题是识别所有重复项,并以某种方式删除其中一个。
use strict;
my %hash = ();
open(tf, "tf_tf_mic.csv");
while ( <tf> ) {
chomp;
# print "$_\n";
my @words = split ",", $_;
if ( exists $hash{"$words[0]\t$words[1]"} || exists $hash{"$words[1]\t$words[0]"} ) {
}
else{
$hash{"$words[0]\t$words[1]"} = $_;
}
}
foreach ( keys %hash ) {
print "$hash{$_}\n";
}
对于400万行文件,这实际上在10秒内工作。
答案 0 :(得分:1)
您可以在将每行放入哈希值之前重新排序:
,
的每一行拆分为字段:my @fields = split /,/; pop @fields;
@fields = sort @fields
; my $str = join "\t", @fields;
$hash{$str} = $_ unless exists $hash{$str}
答案 1 :(得分:1)
不需要这种并发症。如果你排序记录中的字段,以便任何给定的值对总是以相同的顺序,那么你可以简单地打印一个记录,如果它的内容之前没有被看到
use strict;
use warnings 'all';
my %seen;
while ( <DATA> ) {
my @fields = sort /[^,\s]+/g;
print unless $seen{"@fields[0,1]"}++;
}
__DATA__
AB006589__ESR2,BC024181__ESR2,0.47796
AB006589__ESR2,X55739__CSN2,0.47232
AB006589__ESR2,NM_004991__MDS1,0.46704
AB006589__ESR2,NM_003476__CSRP3,0.45767
AB006589__ESR2,NM_012101__TRIM29,0.45094
AB006589__ESR2,NM_006897__HOXC9,0.41748
AB006589__ESR2,NM_000278__PAX2,0.4161
NM_003476__CSRP3,AB006589__ESR2,0.45767
NM_012101__TRIM29,AB006589__ESR2,0.45094
NM_006897__HOXC9,AB006589__ESR2,0.41748
NM_000278__PAX2,AB006589__ESR2,0.4161
AB006589__ESR2,BC024181__ESR2,0.47796
AB006589__ESR2,X55739__CSN2,0.47232
AB006589__ESR2,NM_004991__MDS1,0.46704
AB006589__ESR2,NM_003476__CSRP3,0.45767
AB006589__ESR2,NM_012101__TRIM29,0.45094
AB006589__ESR2,NM_006897__HOXC9,0.41748
AB006589__ESR2,NM_000278__PAX2,0.4161