比较两个文件。如此简单,但比较两个文件,其中一条信息可以灵活对我来说是非常具有挑战性的。
fileA
4 "dup" 37036335 37044984
3 "dup" 100146708 100147504
7 "del" 100 203
2 "dup" 34 89
fileB
4 "dup" 37036335 37036735
3 "dup" 100146708 100147504
4 "dup" 68 109
预期输出:
output_file1 (matching hits)
fileA: 4 "dup" 37036335 37044984
fileB: 4 "dup" 37036335 37036735
fileA: 3 "dup" 100146708 100147504
fileB: 3 "dup" 100146708 100147504
output_file2 (found in fileA, but not in FileB including non-overlap)
7 "del" 100 203
2 "dup" 34 89
output_file3 (found in fileB, but not in FileA including non-overlap)
4 "dup" 68 109
凭据是...... 我需要第一个文件中的字段1和字段2与第二个文件完全匹配,并且字段3中的坐标完全匹配或重叠。
This would mean these are the same.
fileA :4 "dup" 37036335 37044984
fileB :4 "dup" 37036335 37036735
我还需要找到两个文件之间的差异。 (不重叠,1行不存在于一个文件中,但不存在于另一个文件中等)
这是我尝试过的主旨。我用4种不同的方式编写了这段代码,唉,仍然没有成功。我已将两个文件放入数组(我已尝试过哈希... idk)
## if no hits in original, but hits in calculated
if((! @ori) && (@calc)){}
## if CNV calls in original, but none in calculated
if((@ori) && (! @calc)){}
## if CNV calls in both
if((@ori) && (@calc)){
## compare calls with double 'for' loop
foreach my $l (@ori){
my @l = split(/\s/,$l);
my $Ochromosome = $l[0];
my $Ostart = $l[2];
my $Oend = $l[3];
my $Otype = $l[1];
foreach my $l (@calc){
my @l = split(/\s/,$l);
my $Cchromosome = $l[0];
my $Cstart = $l[2];
my $Cend = $l[3];
my $Ctype = $l[1];
## check chromosome and type here
if(($Ochromosome eq $Cchromosome) && ($Otype eq $Ctype)){ ## what if there are two duplications on the same chromosome?
## check coordinates
if(($Ostart <= $Cend) && ($Cstart <= $Oend)){
## overlap
}else{
## noOverlap
}
}else{
## what if there is something found in one, but not in the other and they both have calls?
## ahhhh
}
}
}
答案 0 :(得分:1)
这是一个非常有效的简单解决方案。
迭代一个文件的行,检查每个文件的所有行(直到找到匹配项)。鉴于所有需要收集的信息,至少我们必须做到复杂性。
如果在A
中找不到来自B
的行,则会将其添加到@not_in_B
。要确定B
中的哪些行不在A
中,我们会准备一个散列,其中B
的每个元素都是值为0
的键。一旦/如果找到B
的元素,则哈希中其键的值将设置为1
。 1
的元素从未发现那些最后不是A
的那些,额外的元素也是如此。他们进入@not_in_A
。
为了简单起见,这两个文件首先被读入数组(但内部循环需要 )。
use warnings;
use strict;
use feature 'say';
my $f1 = 'f1.txt';
my $f2 = 'f2.txt';
open my $fh, '<', $f1;
my @a1 = <$fh>; chomp(@a1);
open $fh, '<', $f2;
my @a2 = <$fh>; chomp(@a2);
close $fh;
my (@not_in_A, @not_in_B);
my %Bs_in_A = map { $_ => 0 } @a2;
foreach my $e1 (@a1)
{
my $match = 0;
foreach my $e2 (@a2)
{
if ( lines_match($e1, $e2) ) {
$match = 1;
say "Match:\n\tf1: $e1\n\tf2: $e2";
$Bs_in_A{$e2} = 1;
last;
}
}
push @not_in_B, $e1 if not $match;
}
@not_in_A = grep { $Bs_in_A{$_} == 0 } keys %Bs_in_A;
say '---';
say "Elements of A that are not in B:";
say "\t$_" for @not_in_B;
say "Elements of B that are not in A:";
say "\t$_" for @not_in_A;
sub lines_match
{
my ($l1, $l2) = @_;
my @t1 = split ' ', $l1;
my @t2 = split ' ', $l2;
# First two fields must be the same
return if $t1[0] ne $t2[0] or $t1[1] ne $t2[1];
# Third-to-fourth-field ranges must overlap
return
if ($t1[2] < $t2[2] and $t1[3] < $t2[2])
or ($t1[2] > $t2[3] and $t1[3] > $t2[3]);
return 1; # match
}
输出
Match: f1: 4 "dup" 37036335 37044984 f2: 4 "dup" 37036335 37036735 Match: f1: 3 "dup" 100146708 100147504 f2: 3 "dup" 100146708 100147504 --- Elements of A that are not in B: 7 "del" 100 203 2 "dup" 34 89 Elements of B that are not in A: 4 "dup" 68 109
请注意,我使用1
代替A
和2
代替B
。