Perl用于对数组中值的滑动窗口求和

时间:2012-11-21 00:04:01

标签: arrays perl sum

我想从第四列组织的制表符分隔数据数组中创建一个移动窗口。为简单起见,我用X替换了不相关的字段,并添加了第一行中显示的标题:

ID-Counts    X      X     Start    X      X     Locations      XXXX
 X-5000     [X]    [X]     0      [X]    [X]      1           [X...]
 X-26       [X]    [X]     1      [X]    [X]      1           [X...]
 X-34       [X]    [X]     1      [X]    [X]      0           [X...]
 X-3        [X]    [X]     20     [X]    [X]      9           [X...]
 X-200      [X]    [X]     30     [X]    [X]      0           [X...]
 X-1        [X]    [X]     40     [X]    [X]      5           [X...]

第一列包含数字ID,并计算由连字符连接的ID。第四列包含我要用于对数据进行分组的所有起始站点。第七列包含我需要将计数标准化的位置数。

我希望对每一行求和的总值是通过从ID中分割计数并将它们除以位置数+ 1来确定的(例如,第一行的值为2500,第2行的值为13) ,第三排34)。然后我想对每行中20个单位中具有值的每一行的这些计数/(位置+ 1)求和,从0-19开始,然后是1-20,2-21等。例如,窗口0(第四列的值范围为0-19)将对行1-3进行求和,窗口1将对行2-4求和,窗口2将仅对第4行求和,依此类推。

我的理想输出则是两列:第一列有20个单元窗口的开头(0,1,2,...),第二列有每个窗口的总和(在上面的数据2547中, 47.3等。)

我制作了一个perl脚本,用于过滤和组织这种格式的数据,并希望在20个单元窗口中添加求和代码。作为一个perl新手,我将非常感谢任何帮助和解释。我熟悉跨列的分割和算术功能,但我完全不知道如何在数组中移动窗口。谢谢。

3 个答案:

答案 0 :(得分:0)

我希望我能理解你的问题。您如何看待这些实施?

解决方案1:每次到达单位窗口(20)时写入输出文件。

#Assuming that you have an array of sums (@sums) and name of file ($filename)
my $window_no  = 20;
my $window_sum = 0;
my @window_nos = ();

for (my $i = 1; $i <= $#sums; $i++) {
    push (@window_nos, $i);
    if ( i % window_no == 0 ) {
        write_file($filename, join(',', @window_nos) . "\t" . $window_sum . "\n");
        $window_sum = 0;
        @window_nos = ();
    }
}


if (scalar @window_nos > 1) {
    write_file($filename, join(',', @window_nos) . "\t" . $window_sum) . "\n");
} 

解决方案2:将值附加到标量变量并使用该值向输出文件写入一次。

#Assuming that you have an array of sums (@sums) and name of file ($filename)
my $window_no     = 20;
my $window_sum    = 0;
my @window_nos    = ();
my $file_contents = '';

for (my $i = 1; $i <= $#sums; $i++) {
    push (@window_nos, $i);
    if (i % window_no == 0) {            
        $file_contents .= join(',', @window_nos) . "\t" . $window_sum . "\n";
        $window_sum = 0;
        @window_nos = ();
    }
}

if (scalar @window_nos > 1) {
    $file_contents .= join(',', @window_nos) . "\t" . $window_sum . "\n";
}

write_file($filename, $file_contents);

答案 1 :(得分:0)

查看以下代码,看看它是否符合您的要求。 可能会有所优化,但我基本上在当前Start之上的20个单位窗口内进行了强力搜索 肯

输出:

0-19:  2547.000000
1-20:  47.300000
20-39:  200.300000
30-49:  200.166667
40-59:  0.166667

代码

use strict;
use warnings;

#  Hash indexed by Start
#  Each value contains the sum of all ( Counts/Locations+1 ) for
#     this Start value
my %sum;

while (<DATA>)
{
    #  ignore comments
    next if /^\s*#/;
    my ( $id_count,undef,undef,$start,undef,undef,$numLocations ) = 
       split ' ';
    my ($id,$count) = split '-',$id_count;
    $sum{$start} += $count / ( $numLocations + 1 );
}  

foreach my $start ( sort keys %sum )
{
   my $totalSum = 0;
   #  Could probably be optimized.
   foreach my $start2 ( $start .. $start+19 )
   {
      $totalSum += $sum{$start2} if defined($sum{$start2});      
   }
   printf "%d-%d:  %f\n", $start, $start+19, $totalSum;
}

__DATA__
#ID-Counts    X      X     Start    X      X     Locations      XXXX
 X-5000     [X]    [X]     0      [X]    [X]      1           [X...]
 X-26       [X]    [X]     1      [X]    [X]      1           [X...]
 X-34       [X]    [X]     1      [X]    [X]      0           [X...]
 X-3        [X]    [X]     20     [X]    [X]      9           [X...]
 X-200      [X]    [X]     30     [X]    [X]      0           [X...]
 X-1        [X]    [X]     40     [X]    [X]      5           [X...]

答案 2 :(得分:0)

这个怎么样?

#!/usr/bin/perl -Tw

use strict;
use warnings;
use Data::Dumper;

my %sum_for;

while ( my $line = <DATA> ) {

    if ( $line !~ m{\A [#] }xms ) {

        $line =~ s{\A \s* ( [^-]+ ) - }{$1 }xms;    # separate the ID

        my @columns = split /\s+/, $line;    # assumes no space in values

        my $count = $columns[1];
        my $start = $columns[4];
        my $locat = $columns[7] + 1;

        $sum_for{$start} += $count / $locat;
    }
}

print Dumper( \%sum_for );

my @start_ranges;
{
    my ($max_start) = sort { $b <=> $a } keys %sum_for;

    # max => range count
    #  10 => 1
    #  20 => 2
    #  30 => 2
    #  40 => 3
    #  50 => 3
    #  ...
    my $range_count = $max_start / 20;

    push @start_ranges, [ 0, 19 ];

    for ( 1 .. $range_count ) {

        push @start_ranges, [ map { $_ + 20 } @{ $start_ranges[-1] } ];
    }
}

my %total_for;

for my $range_ra (@start_ranges) {

    my $range_key = sprintf '%d-%d', @{$range_ra};

    for my $start ( $range_ra->[0] .. $range_ra->[1] ) {

        if ( exists $sum_for{$start} ) {

            $total_for{$range_key} += $sum_for{$start};
        }
    }
}

print Dumper( \%total_for );

__DATA__
#ID-Counts    X      X     Start    X      X     Locations      XXXX
 X-5000     [X]    [X]     0      [X]    [X]      1           [X...]
 X-26       [X]    [X]     1      [X]    [X]      1           [X...]
 X-34       [X]    [X]     1      [X]    [X]      0           [X...]
 X-3        [X]    [X]     20     [X]    [X]      9           [X...]
 X-200      [X]    [X]     30     [X]    [X]      0           [X...]
 X-1        [X]    [X]     40     [X]    [X]      5           [X...]

输出结果如下:

$VAR1 = {
          '1' => 47,
          '40' => '0.166666666666667',
          '0' => 2500,
          '30' => 200,
          '20' => '0.3'
        };
$VAR1 = {
          '40-59' => '0.166666666666667',
          '20-39' => '200.3',
          '0-19' => 2547
        };

计算起始范围的一点需要一点思考。 感谢有趣的问题。