我想从第四列组织的制表符分隔数据数组中创建一个移动窗口。为简单起见,我用X替换了不相关的字段,并添加了第一行中显示的标题:
ID-Counts X X Start X X Locations XXXX
X-5000 [X] [X] 0 [X] [X] 1 [X...]
X-26 [X] [X] 1 [X] [X] 1 [X...]
X-34 [X] [X] 1 [X] [X] 0 [X...]
X-3 [X] [X] 20 [X] [X] 9 [X...]
X-200 [X] [X] 30 [X] [X] 0 [X...]
X-1 [X] [X] 40 [X] [X] 5 [X...]
第一列包含数字ID,并计算由连字符连接的ID。第四列包含我要用于对数据进行分组的所有起始站点。第七列包含我需要将计数标准化的位置数。
我希望对每一行求和的总值是通过从ID中分割计数并将它们除以位置数+ 1来确定的(例如,第一行的值为2500,第2行的值为13) ,第三排34)。然后我想对每行中20个单位中具有值的每一行的这些计数/(位置+ 1)求和,从0-19开始,然后是1-20,2-21等。例如,窗口0(第四列的值范围为0-19)将对行1-3进行求和,窗口1将对行2-4求和,窗口2将仅对第4行求和,依此类推。
我的理想输出则是两列:第一列有20个单元窗口的开头(0,1,2,...),第二列有每个窗口的总和(在上面的数据2547中, 47.3等。)
我制作了一个perl脚本,用于过滤和组织这种格式的数据,并希望在20个单元窗口中添加求和代码。作为一个perl新手,我将非常感谢任何帮助和解释。我熟悉跨列的分割和算术功能,但我完全不知道如何在数组中移动窗口。谢谢。
答案 0 :(得分:0)
我希望我能理解你的问题。您如何看待这些实施?
解决方案1:每次到达单位窗口(20)时写入输出文件。
#Assuming that you have an array of sums (@sums) and name of file ($filename)
my $window_no = 20;
my $window_sum = 0;
my @window_nos = ();
for (my $i = 1; $i <= $#sums; $i++) {
push (@window_nos, $i);
if ( i % window_no == 0 ) {
write_file($filename, join(',', @window_nos) . "\t" . $window_sum . "\n");
$window_sum = 0;
@window_nos = ();
}
}
if (scalar @window_nos > 1) {
write_file($filename, join(',', @window_nos) . "\t" . $window_sum) . "\n");
}
解决方案2:将值附加到标量变量并使用该值向输出文件写入一次。
#Assuming that you have an array of sums (@sums) and name of file ($filename)
my $window_no = 20;
my $window_sum = 0;
my @window_nos = ();
my $file_contents = '';
for (my $i = 1; $i <= $#sums; $i++) {
push (@window_nos, $i);
if (i % window_no == 0) {
$file_contents .= join(',', @window_nos) . "\t" . $window_sum . "\n";
$window_sum = 0;
@window_nos = ();
}
}
if (scalar @window_nos > 1) {
$file_contents .= join(',', @window_nos) . "\t" . $window_sum . "\n";
}
write_file($filename, $file_contents);
答案 1 :(得分:0)
查看以下代码,看看它是否符合您的要求。
可能会有所优化,但我基本上在当前Start之上的20个单位窗口内进行了强力搜索
肯
输出:
0-19: 2547.000000
1-20: 47.300000
20-39: 200.300000
30-49: 200.166667
40-59: 0.166667
代码
use strict;
use warnings;
# Hash indexed by Start
# Each value contains the sum of all ( Counts/Locations+1 ) for
# this Start value
my %sum;
while (<DATA>)
{
# ignore comments
next if /^\s*#/;
my ( $id_count,undef,undef,$start,undef,undef,$numLocations ) =
split ' ';
my ($id,$count) = split '-',$id_count;
$sum{$start} += $count / ( $numLocations + 1 );
}
foreach my $start ( sort keys %sum )
{
my $totalSum = 0;
# Could probably be optimized.
foreach my $start2 ( $start .. $start+19 )
{
$totalSum += $sum{$start2} if defined($sum{$start2});
}
printf "%d-%d: %f\n", $start, $start+19, $totalSum;
}
__DATA__
#ID-Counts X X Start X X Locations XXXX
X-5000 [X] [X] 0 [X] [X] 1 [X...]
X-26 [X] [X] 1 [X] [X] 1 [X...]
X-34 [X] [X] 1 [X] [X] 0 [X...]
X-3 [X] [X] 20 [X] [X] 9 [X...]
X-200 [X] [X] 30 [X] [X] 0 [X...]
X-1 [X] [X] 40 [X] [X] 5 [X...]
答案 2 :(得分:0)
这个怎么样?
#!/usr/bin/perl -Tw
use strict;
use warnings;
use Data::Dumper;
my %sum_for;
while ( my $line = <DATA> ) {
if ( $line !~ m{\A [#] }xms ) {
$line =~ s{\A \s* ( [^-]+ ) - }{$1 }xms; # separate the ID
my @columns = split /\s+/, $line; # assumes no space in values
my $count = $columns[1];
my $start = $columns[4];
my $locat = $columns[7] + 1;
$sum_for{$start} += $count / $locat;
}
}
print Dumper( \%sum_for );
my @start_ranges;
{
my ($max_start) = sort { $b <=> $a } keys %sum_for;
# max => range count
# 10 => 1
# 20 => 2
# 30 => 2
# 40 => 3
# 50 => 3
# ...
my $range_count = $max_start / 20;
push @start_ranges, [ 0, 19 ];
for ( 1 .. $range_count ) {
push @start_ranges, [ map { $_ + 20 } @{ $start_ranges[-1] } ];
}
}
my %total_for;
for my $range_ra (@start_ranges) {
my $range_key = sprintf '%d-%d', @{$range_ra};
for my $start ( $range_ra->[0] .. $range_ra->[1] ) {
if ( exists $sum_for{$start} ) {
$total_for{$range_key} += $sum_for{$start};
}
}
}
print Dumper( \%total_for );
__DATA__
#ID-Counts X X Start X X Locations XXXX
X-5000 [X] [X] 0 [X] [X] 1 [X...]
X-26 [X] [X] 1 [X] [X] 1 [X...]
X-34 [X] [X] 1 [X] [X] 0 [X...]
X-3 [X] [X] 20 [X] [X] 9 [X...]
X-200 [X] [X] 30 [X] [X] 0 [X...]
X-1 [X] [X] 40 [X] [X] 5 [X...]
输出结果如下:
$VAR1 = {
'1' => 47,
'40' => '0.166666666666667',
'0' => 2500,
'30' => 200,
'20' => '0.3'
};
$VAR1 = {
'40-59' => '0.166666666666667',
'20-39' => '200.3',
'0-19' => 2547
};
计算起始范围的一点需要一点思考。 感谢有趣的问题。