Question

我有两个制表符分隔的表：

table1

col1    col2    col3    col4
id1     chr1     1       10
id2     chr1     15      20
id3     chr1     30      35


table2

col1    col2    col3
rs1     5       chr1
rs2     11      chr1
rs3     34      chr1
rs4     35      chr1

我想检查col2-table2中是否存在col3和amp;值之间的值。 col4 - table1。如果是这种情况，我想打印出col1＆amp;的相应值。 col2进入table1的新列。

因此，在此示例中，最终结果文件应如下所示：

 table output
 col1    col2   col3   col4   new_col1    
 id1     chr1    1      10     rs1:5
 id2     chr1    15     20     
 id3     chr1    30     35     rs3:34, rs4:35

我在这里有几个问题： - 我想我应该使用2 while循环。 - 通常情况下，如果我想存储值，我会使用哈希值，然后查看另一个表中是否与此值匹配。但是在这里我必须存储2个值，因为我需要查看table2的值是否存在于table1中的两个值的范围内。 - 如何在new_col1中存储值

我想到了这样的东西来存储范围（我在perl工作）：

my @range;
while (<$table1>){
    my @cols = split (/\t/);
    $range[$_] .= "$range" for $cols[$2] .. $cols[$3]; #store the ranges
}
chop @range;

但是如何与$ table2进行比较？

更新：我不仅要检查col2-table2中是否存在col3＆amp; col4-table1中值之间的值。我还需要检查col2-table1和col3-table3之间是否匹配。如果确实存在匹配，则可以检查我描述的第一件事（col2-table中的值在col3＆amp; col4-table1中的值之间）。

Answer 1

这会按照你的要求行事。它的工作原理是将table2中的所有信息读入数组@table2。然后逐行处理table1，根据到目前为止累积的数据计算第五列，并将结果打印到STDOUT。

use strict;
use warnings;
use 5.010;
use autodie;

my @table2;
open my $fh, '<', 'table2.txt';
while (<$fh>) {
  my @columns = split;
  next if $columns[1] =~ /\D/;
  push @table2, \@columns;
}

open $fh, '<', 'table1.txt';
while (<$fh>) {
  my @columns = split;
  if ( grep /\D/, @columns[2,3] ) {
    push @columns, 'new_col1';
  }
  else {
    my @matches = grep { $_->[1] >= $columns[2] and $_->[1] <= $columns[3]  } @table2;
    push @columns, join(', ', map join(':', @$_), @matches);
  }
  print join("\t", @columns), "\n";
}

<强>输出

col1  col2  col3  col4  new_col1
id1 ... 1 10  rs1:5
id2 ... 15  20  
id3 ... 30  35  rs3:34, rs4:35

Answer 2

我认为你正在倒退这个问题。首先将table2解析为哈希会使问题变得更容易。因为那样你可以迭代table1并检查相关范围内的任何值。

use strict;
use warnings;
use Data::Dumper;

my %table2;

while (<DATA>) {
    #stop reading if we've finished with table2
    last if m/^table1/;

    next unless m/^rs/;
    my ( $col1, $col2 ) = split(/\s+/);
    $table2{$col1} = $col2;
}

print Dumper \%table2;

while (<DATA>) {

    next unless m/^id/;
    chomp;
    my ( $rowid, $col2, $lower, $upper ) = split(/\s+/);
    my $newcol = "";
    foreach my $rs ( keys %table2 ) {
        if (    $table2{$rs} >= $lower
            and $table2{$rs} <= $upper )
        {
            $newcol .= " $rs:$table2{$rs}";
        }
    }
    print join( "\t", $rowid, $col2, $lower, $upper, $newcol, ), "\n";
}


__DATA__
table2
col1    col2
rs1     5   
rs2     11
rs3     34
rs4     35

table1
col1    col2    col3    col4
id1     ...     1       10
id2     ...     15      20
id3     ...     30      35

<强>输出

$VAR1 = {
          'rs1' => '5',
          'rs2' => '11',
          'rs4' => '35',
          'rs3' => '34'
        };
id1 ... 1 10   rs1:5
id2 ... 15  20  
id3 ... 30  35   rs4:35 rs3:34

查找2个值范围内的值

2 个答案: