我有一个如下所示的数据输入:
seq 75 T G -
seq 3185 A R +
seq 3382 A R +
seq 4923 C - + *
seq 4924 C - + *
seq 4925 T - + *
seq 5252 A W +
seq 7400 T C -
seq 16710 C - - #
seq 18248 T C -
seq 18962 C - + *
seq 18963 A - + *
seq 18964 T - + *
seq 18965 A - + *
seq 19566 A M +
The input above is already sorted at 2nd column.
我想做的是:
因此我们期望获得此输出:
seq 75 T G -
seq 3185 A R +
seq 3382 A R +
seq 4923 CCT - + **
seq 5252 A W +
seq 7400 T C -
seq 16710 C - - #
seq 18248 T C -
seq 18962 CATA - + **
seq 19566 A M +
** Are the new lines/string formed by * line in first list (input)
# line is kept as it is because there is no consecutive position after that.
我坚持以下逻辑,不知道如何继续:
while ( <> ) {
chomp;
my @els = split(/\s+/,$_);
# Process indel
my @temp = ();
if ( $els[3] eq "-" ) {
push @temp, $_;
}
# How can I group them appropriately.
print Dumper \@temp ;
# And print accordingly to input ordering
}
答案 0 :(得分:5)
这是控制中断报告的变体。这段代码似乎可以完成这项工作:
use strict;
use warnings;
my($prev) = -100;
my($grp0) = $prev;
my($col2, $col4);
sub print_group
{
my($grp0, $col2, $col3, $col4) = @_;
printf "seq %-5d %-4s %s %s\n", $grp0, $col2, $col3, $col4
if ($grp0 > 0);
}
while (<>)
{
chomp;
my @els = split(/\s+/,$_);
if ($els[3] ne "-")
{
print_group($grp0, $col2, "-", $col4);
print_group($els[1], $els[2], $els[3], $els[4]);
$prev = -100;
$grp0 = -100;
$col2 = "";
$col4 = "";
}
elsif ($els[1] == $prev + 1)
{
$grp0 = $prev if $grp0 < 0;
$prev = $els[1];
$col2 .= $els[2];
$col4 = $els[4];
}
else
{
print_group($grp0, $col2, "-", $col4);
$prev = $els[1];
$grp0 = $els[1];
$col2 = $els[2];
$col4 = $els[4];
}
}
print_group($grp0, $col2, $col4);
示例输出:
seq 75 T G -
seq 3185 A R +
seq 3382 A R +
seq 4923 CCT - +
seq 5252 A W +
seq 7400 T C -
seq 16710 C - -
seq 18248 T C -
seq 18962 CATA - +
seq 19566 A M +
这是比上一版更加统一的输出,但基本逻辑与以前非常相似。输出总是由相同的函数生成,因此一切都尽可能均匀。
要使条件正确可能非常困难 - 需要几次(太多)迭代才能使此代码生成预期的输出。