比较两个文件,其中一条信息可以灵活

时间:2017-01-05 21:09:24

标签: perl

比较两个文件。如此简单,但比较两个文件,其中一条信息可以灵活对我来说是非常具有挑战性的。

fileA 
4 "dup" 37036335 37044984   
3 "dup" 100146708 100147504 
7 "del" 100 203
2 "dup" 34 89

fileB
4 "dup" 37036335 37036735
3 "dup" 100146708 100147504
4 "dup" 68 109

预期输出:

output_file1 (matching hits)
fileA: 4 "dup" 37036335 37044984
fileB: 4 "dup" 37036335 37036735

fileA: 3 "dup" 100146708 100147504
fileB: 3 "dup" 100146708 100147504

output_file2 (found in fileA, but not in FileB including non-overlap)
7 "del" 100 203
2 "dup" 34 89

output_file3 (found in fileB, but not in FileA including non-overlap)
4 "dup" 68 109

凭据是...... 我需要第一个文件中的字段1和字段2与第二个文件完全匹配,并且字段3中的坐标完全匹配或重叠。

This would mean these are the same.
fileA :4 "dup" 37036335 37044984 
fileB :4 "dup" 37036335 37036735

我还需要找到两个文件之间的差异。 (不重叠,1行不存在于一个文件中,但不存在于另一个文件中等)

这是我尝试过的主旨。我用4种不同的方式编写了这段代码,唉,仍然没有成功。我已将两个文件放入数组(我已尝试过哈希... idk)

## if no hits in original, but hits in calculated
   if((! @ori) && (@calc)){}

## if CNV calls in original, but none in calculated
   if((@ori) && (! @calc)){}

## if CNV calls in both
   if((@ori) && (@calc)){

         ## compare calls with double 'for' loop
         foreach my $l (@ori){

                my @l = split(/\s/,$l);
                my $Ochromosome = $l[0];
                my $Ostart = $l[2];
                my $Oend = $l[3];
                my $Otype = $l[1];

                foreach my $l (@calc){

                       my @l = split(/\s/,$l);
                       my $Cchromosome = $l[0];
                       my $Cstart = $l[2];
                       my $Cend = $l[3];
                       my $Ctype = $l[1];

                       ## check chromosome and type here
                     if(($Ochromosome eq $Cchromosome) && ($Otype eq $Ctype)){ ## what if there are two duplications on the same chromosome?
                             ## check coordinates
                             if(($Ostart <= $Cend) && ($Cstart <= $Oend)){
                                  ## overlap
                              }else{
                                  ## noOverlap
                              }                                       
                       }else{
                         ## what if there is something found in one, but not in the other and they both have calls?
                         ## ahhhh
                                        }                               
                                }
                        }

1 个答案:

答案 0 :(得分:1)

这是一个非常有效的简单解决方案。

迭代一个文件的行,检查每个文件的所有行(直到找到匹配项)。鉴于所有需要收集的信息,至少我们必须做到复杂性。

如果在A中找不到来自B的行,则会将其添加到@not_in_B。要确定B中的哪些行不在A中,我们会准备一个散列,其中B的每个元素都是值为0的键。一旦/如果找到B的元素,则哈希中其键的值将设置为11的元素从未发现那些最后不是A的那些,额外的元素也是如此。他们进入@not_in_A

为了简单起见,这两个文件首先被读入数组(但内部循环需要 )。

use warnings;
use strict;
use feature 'say';

my $f1 = 'f1.txt';
my $f2 = 'f2.txt';

open my $fh, '<', $f1;
my @a1 = <$fh>; chomp(@a1);
open $fh, '<', $f2;
my @a2 = <$fh>; chomp(@a2);
close $fh;

my (@not_in_A, @not_in_B);
my %Bs_in_A = map { $_ => 0 } @a2;

foreach my $e1 (@a1)
{
    my $match = 0;
    foreach my $e2 (@a2) 
    {
        if ( lines_match($e1, $e2) ) { 
            $match = 1;
            say "Match:\n\tf1: $e1\n\tf2: $e2";
            $Bs_in_A{$e2} = 1;
            last;
        }
    }   
    push @not_in_B, $e1 if not $match;
}
@not_in_A = grep { $Bs_in_A{$_} == 0 } keys %Bs_in_A;

say '---';    
say "Elements of A that are not in B:";
say "\t$_" for @not_in_B;
say "Elements of B that are not in A:";
say "\t$_" for @not_in_A;


sub lines_match
{
    my ($l1, $l2) = @_; 
    my @t1 = split ' ', $l1;
    my @t2 = split ' ', $l2;

    # First two fields must be the same
    return if $t1[0] ne $t2[0] or $t1[1] ne $t2[1];

    # Third-to-fourth-field ranges must overlap
    return
        if ($t1[2] < $t2[2] and $t1[3] < $t2[2])
        or ($t1[2] > $t2[3] and $t1[3] > $t2[3]);

    return 1;  # match
}

输出

Match:
        f1: 4 "dup" 37036335 37044984   
        f2: 4 "dup" 37036335 37036735
Match:
        f1: 3 "dup" 100146708 100147504 
        f2: 3 "dup" 100146708 100147504
---
Elements of A that are not in B:
        7 "del" 100 203
        2 "dup" 34 89
Elements of B that are not in A:
        4 "dup" 68 109

请注意,我使用1代替A2代替B