如何在Perl中对连续编号进行分组

时间:2011-04-25 03:09:19

标签: perl

我有一个如下所示的数据输入:

seq   75      T   G   - 
seq   3185    A   R   +
seq   3382    A   R   +
seq   4923    C   -   + *
seq   4924    C   -   + *
seq   4925    T   -   + *
seq   5252    A   W   +
seq   7400    T   C   -
seq   16710   C   -   - #
seq   18248   T   C   -
seq   18962   C   -   + *
seq   18963   A   -   + *
seq   18964   T   -   + *
seq   18965   A   -   + *
seq   19566   A   M   +

The input above is already sorted at 2nd column.

我想做的是:

  1. 仅处理第4列为“ - ”的行。
  2. 如果这些行包含连续的位置(第2列),请将它们分组
  3. 将它们表示为一个新位置,其中最低位置为新位置 以及将分组字母串联为新字符串。
  4. 因此我们期望获得此输出:

    seq   75      T   G   -   
    seq   3185    A   R   +
    seq   3382    A   R   +
    seq   4923    CCT   -   + **
    seq   5252    A   W   +
    seq   7400    T   C   -
    seq   16710   C   -   - #
    seq   18248   T   C   -
    seq   18962   CATA   -   + **
    seq   19566   A   M   +
    
    ** Are the new lines/string formed by * line in first list (input)
    # line is kept as it is because there is no consecutive position after that.
    

    我坚持以下逻辑,不知道如何继续:

    while ( <> ) {
        chomp;
    
        my @els = split(/\s+/,$_);
    
        # Process indel
        my @temp = ();
        if ( $els[3] eq "-"  ) {
            push @temp, $_;
        }
    
         # How can I group them appropriately.
         print Dumper \@temp ;
    
         # And print accordingly to input ordering
    
    }
    

1 个答案:

答案 0 :(得分:5)

这是控制中断报告的变体。这段代码似乎可以完成这项工作:

use strict;
use warnings;

my($prev) = -100;
my($grp0) = $prev;
my($col2, $col4);

sub print_group
{
    my($grp0, $col2, $col3, $col4) = @_;
    printf "seq   %-5d  %-4s  %s  %s\n", $grp0, $col2, $col3, $col4
        if ($grp0 > 0);
}

while (<>)
{
    chomp;
    my @els = split(/\s+/,$_);
    if ($els[3] ne "-")
    {
        print_group($grp0,   $col2,   "-",     $col4);
        print_group($els[1], $els[2], $els[3], $els[4]);
        $prev = -100;
        $grp0 = -100;
        $col2 = "";
        $col4 = "";
    }
    elsif ($els[1] == $prev + 1)
    {
        $grp0  = $prev if $grp0 < 0;
        $prev  = $els[1];
        $col2 .= $els[2];
        $col4  = $els[4];
    }
    else
    {
        print_group($grp0, $col2, "-", $col4);
        $prev = $els[1];
        $grp0 = $els[1];
        $col2 = $els[2];
        $col4 = $els[4];
    }
}

print_group($grp0, $col2, $col4);

示例输出:

seq   75     T     G  -
seq   3185   A     R  +
seq   3382   A     R  +
seq   4923   CCT   -  +
seq   5252   A     W  +
seq   7400   T     C  -
seq   16710  C     -  -
seq   18248  T     C  -
seq   18962  CATA  -  +
seq   19566  A     M  +

这是比上一版更加统一的输出,但基本逻辑与以前非常相似。输出总是由相同的函数生成,因此一切都尽可能均匀。

要使条件正确可能非常困难 - 需要几次(太多)迭代才能使此代码生成预期的输出。