基于另一个文件的范围的平均列内容

时间:2014-03-07 03:42:38

标签: arrays perl variables

我有两个文件:.bedGraph和.bed。 .bedGraph包含coordinations + intensity值(chr,start,end,intensity),而.bed文件只有坐标(chr,start,end)。

通过将最远1000bp的坐标拉到一起来制作床文件。这减少了从床图上的约6600万读数到~300k。

所以,我的bedGraph看起来像这样

chr1    10037   10038   0.413963 
chr1    10393   10428   0.827926 
chr1    10540   10546   0.413963 
chr1    10610   10615   0.413963 
chr1    11281   11282   0.413963 

我的床看起来像这样

chr1    10037   56175
chr1    57265   58983
chr1    60022   64415
chr1    65485   74471
chr1    76305   177390
chr1    227433  267689
chr1    317665  384576
chr1    386108  417753
chr1    420243  423692
chr1    425613  426755

所以我现在要做的是在床图上添加一个列,该列具有该区域内读取的平均强度(取自.bedGraph文件),即

.bedGraph
chr   1   10   1.23413    |
chr   11  18   0.234      | this <<----------
chr   20  24   4.231      |                 |
chr   57  100  2.123413   |                 |
chr   101 123  2.333                        |
                                            |
            I want to add this              | 
                |                           |
                |                           |
                V                           |
.bed                                        |
chr   1   100  (average of ------------------
chr   110 400  (same as above for another interval)

所以...到目前为止我编写了一个脚本,我的想法是获取.bed文件的坐标,然后将bedGraph文件中的所有强度值存储在该间隔内的数据中,然后打印出来原床+平均强度值...到目前为止很容易... 这是我的代码:

#! /usr/bin/perl
use strict;
use warnings;
use List::Util qw(sum);

############################
## call with
## perl average_intensities.pl IN1.bed IN2.bedGraph > OUT.bedGraph
############################

my ($file1, $file2) = @ARGV;

if (not defined $file1) {
    die "Need name INPUT 1 file (bed)\n";
}

if (not defined $file2) {
    die "Need name INPUT 2 file (bedGraph)\n";
}

#declare stuff for first file
my @coords1;
my $chr1;
my $start1;
my $end1;

my @coords2;
my $chr2;
my $start2;
my $end2;
my $int;
my @intensity;
my $av_int;

print "about to open files\n"; ## <<-- this doesn't even print :(

open (IN1, '<', $file1) or die "Could not open $file1: $! \n";
open (IN2, '<' ,$file2) or die "Could not open $file2: $! \n";

#parse first file and get teh first coordinates
while(<IN1>){
    chomp $_;

    @coords1 = split "\t", $_;
    $chr1 = $coords1[0];
    $start1 = $coords1[1];
    $end1 = $coords1[2];

    #parse second file and get the coordinates + intensities
    while(<IN2>){
        chomp $_;
        @coords2 = split "\t", $_;
        $chr2 = $coords2[0];
        $start2 = $coords2[1];
        $end2 = $coords2[2];
        $int = $coords2[3];
        if ($chr1 eq $chr2){

            # if the coordinates on bedGraph are still < than those on bed save the average intensity
            if($start1 <= $end2){
                push @intensity, $int;
            } else {
                if (scalar @intensity >0){
                    $av_int = sum(@intensity)/(scalar @intensity);
                    print join ("\t", $chr1, $start1, $start2, $av_int),"\n";
                    @intensity = ();
                    last;
                }
            }
        } else {
            next;
        }
    }
}
close(IN1);
close(IN2);

然而,当我尝试运行它时,它会告诉我

Use of uninitialized value $start2 in numeric le (<=) at average_intensities.pl line 49, <IN2> line 1.
Use of uninitialized value $start1 in numeric le (<=) at average_intensities.pl line 49, <IN2> line 1.

(...并继续文件中的所有行)我无法理解为什么因为我确实声明了两个变量。 我不确定在这一点导致它的代码有什么问题...... 任何建议都会很棒! 谢谢:))

###########################################

以下更新的代码 我按照Kenosis的建议更正了代码,并稍微修改了他的脚本:

open IN1, "$file1" or die "Could not open file: $! \n";
open IN2, "$file2" or die "Could not open file: $! \n";

my %bedGraphHoA;


while (<IN1>) {
    my @cols = split;
    push @{ $bedGraphHoA{ $cols[0] } }, [ @cols[ 1 .. 3 ] ];
}

close IN1;

while (<IN2>) {
    my ( @bedGaphLines, @bedGaphVals );
    my @cols = split;
    if ( exists $bedGraphHoA{ $cols[0] } ) {

        for my $elements ( @{ $bedGraphHoA{ $cols[0] } } ) {

            if ( $elements->[0] >= $cols[1] and $elements->[1] <= $cols[2] ) {
                push @bedGaphLines, $elements;
                push @bedGaphVals,  $elements->[2];
            }
        }
        if (scalar @bedGaphVals > 0){
            my $mean = ( sum @bedGaphVals ) / @bedGaphVals;
            print join( "\t", $cols[0],$cols[1], $cols[2], $mean ), "\n";
        }

    }
}

close IN2;

我在真实数据的一个子集上进行了测试,看起来很有效

1 个答案:

答案 0 :(得分:1)

你有:

@coords1 = split $line1, "\t";

当你的意思是:

@coords1 = split "\t", $line1;

以后相同,你有:

@coords2 = split $line2, "\t";
你的意思是:

@coords2 = split "\t", $line2;

$start1$start2分别从split@coords1@coords2的结果中获取值。

也许以下内容将为您的努力提供一些方向:

use strict;
use warnings;
use List::Util qw/sum/;

my %bedGraphHoA;

open my $bedGraphFH, '<', 'bedGraph.txt' or die $!;

while (<$bedGraphFH>) {
    my @cols = split;
    push @{ $bedGraphHoA{ $cols[0] } }, [ @cols[ 1 .. 3 ] ];
}

close $bedGraphFH;

open my $bedFH, '<', 'bed.txt' or die $!;

while (<$bedFH>) {
    my ( @bedGaphLines, @bedGaphVals );
    my @cols = split;
    if ( exists $bedGraphHoA{ $cols[0] } ) {
        for my $elements ( @{ $bedGraphHoA{ $cols[0] } } ) {
            if ( $elements->[0] >= $cols[1] and $elements->[1] <= $cols[2] ) {
                push @bedGaphLines, $elements;
                push @bedGaphVals,  $elements->[2];
            }
        }

    }
    my $mean = ( sum @bedGaphVals ) / @bedGaphVals;
        print join( "\t", $cols[0], @{ $bedGaphLines[$_] }, $mean ), "\n"
          for 0 .. $#bedGaphLines;
}

close $bedFH;

__END__

bedGraph.txt:
chr   1   10   1.23413
chr   11  18   0.234
chr   20  24   4.231
chr   57  100  2.123413
chr   101 123  2.333
chr   120 123  7.555
chr   150 200  1.275

bed.txt:
chr   1   100
chr   110 400

Output:
chr 1   10  1.23413 1.95563575
chr 11  18  0.234   1.95563575
chr 20  24  4.231   1.95563575
chr 57  100 2.123413    1.95563575
chr 120 123 7.555   4.415
chr 150 200 1.275   4.415