匹配不同行和总和的列

时间:2014-10-27 22:18:03

标签: perl

我有一个大约160,000行的csv,它看起来像这样:

chr1,160,161,3,0.333333333333333,+         
chr1,161,162,4,0.5,-      
chr1,309,310,14,0.0714285714285714,+     
chr1,311,312,2,0.5,-     
chr1,499,500,39,0.717948717948718,+     
chr2,500,501,8,0.375,-      
chr2,510,511,18,0.5,+         
chr2,511,512,6,0.333333333333333,-    

我想将第1列相同的行配对,第3列与第2列匹配,第6列为'+',而在另一行,它是'-'。如果这是真的,我想总结第4栏和第5栏。

我想要的输出将是

chr1,160,161,7,0.833333333333333,+         
chr1,309,310,14,0.0714285714285714,+     
chr1,311,312,2,0.5,-     
chr1,499,500,39,0.717948717948718,+     
chr2,500,501,8,0.375,-      
chr2,510,511,24,0.833333333333333,-  

我能想到的最佳解决方案是复制文件,然后匹配文件之间的列,并将其与perl重复:

#!/usr/bin/perl             
use strict;      
use warnings;          
open my $firstfile, '<', $ARGV[0] or die "$!";         
open my $secondfile, '<', $ARGV[1] or die "$!";            
my ($chr_a, $chr_b,$start,$end,$begin,$finish, $sum_a, $sum_b, $total_a, 
    $total_b,$sign_a,$sign_b);             

while (<$firstfile>) {
    my @col = split /,/;
    $chr_a  = $col[0];
    $start  = $col[1];
    $end    = $col[2];
    $sum_a  = $col[3];
    $total_a = $col[4];
    $sign_a = $col[5];

    seek($secondfile,0,0);
    while (<$secondfile>) {
       my @seccol = split /,/;
       $chr_b     = $seccol[0];
       $begin     = $seccol[1];
       $finish    = $seccol[2];
       $sum_b     = $seccol[3];
       $total_b   = $seccol[4];
       $sign_b    = $seccol[5];

       print join ("\t", $col[0], $col[1], $col[2], $col[3]+=$seccol[3], 
                         $col[4]+=$seccol[4], $col[5]), 
           "\n" if ($chr_a eq $chr_b and $end==$begin and $sign_a ne $sign_b);
    }

}

这样做很好,但理想情况下我希望能够在文件本身内执行此操作而不必复制它,因为我有很多文件,所以我想在所有文件上运行脚本这不那么耗时。 感谢。

1 个答案:

答案 0 :(得分:1)

如果没有对我的评论作出回应,此程序将按照您提供的数据进行处理。

use strict;
use warnings;

my @last;

while (<DATA>) {
  s/\s+\z//;
  my @line = split /,/;

  if (@last
      and $last[0] eq $line[0]
      and $last[2] eq $line[1]
      and $last[5] eq '+' and $line[5] eq '-') {

    $last[3] += $line[3];
    $last[4] += $line[4];
    print join(',', @last), "\n";
    @last = ()
  }
  else {
    print join(',', @last), "\n" if @last;
    @last = @line;
  }
}

print join(',', @last), "\n" if @last;

__DATA__
chr1,160,161,3,0.333333333333333,+         
chr1,161,162,4,0.5,-      
chr1,309,310,14,0.0714285714285714,+     
chr1,311,312,2,0.5,-     
chr1,499,500,39,0.717948717948718,+     
chr2,500,501,8,0.375,-      
chr2,510,511,18,0.5,+         
chr2,511,512,6,0.333333333333333,-

<强>输出

chr1,160,161,7,0.833333333333333,+
chr1,309,310,14,0.0714285714285714,+
chr1,311,312,2,0.5,-
chr1,499,500,39,0.717948717948718,+
chr2,500,501,8,0.375,-
chr2,510,511,24,0.833333333333333,+