我有两个文件:.bedGraph和.bed。 .bedGraph包含coordinations + intensity值(chr,start,end,intensity),而.bed文件只有坐标(chr,start,end)。
通过将最远1000bp的坐标拉到一起来制作床文件。这减少了从床图上的约6600万读数到~300k。
所以,我的bedGraph看起来像这样
chr1 10037 10038 0.413963
chr1 10393 10428 0.827926
chr1 10540 10546 0.413963
chr1 10610 10615 0.413963
chr1 11281 11282 0.413963
我的床看起来像这样
chr1 10037 56175
chr1 57265 58983
chr1 60022 64415
chr1 65485 74471
chr1 76305 177390
chr1 227433 267689
chr1 317665 384576
chr1 386108 417753
chr1 420243 423692
chr1 425613 426755
所以我现在要做的是在床图上添加一个列,该列具有该区域内读取的平均强度(取自.bedGraph文件),即
.bedGraph
chr 1 10 1.23413 |
chr 11 18 0.234 | this <<----------
chr 20 24 4.231 | |
chr 57 100 2.123413 | |
chr 101 123 2.333 |
|
I want to add this |
| |
| |
V |
.bed |
chr 1 100 (average of ------------------
chr 110 400 (same as above for another interval)
所以...到目前为止我编写了一个脚本,我的想法是获取.bed文件的坐标,然后将bedGraph文件中的所有强度值存储在该间隔内的数据中,然后打印出来原床+平均强度值...到目前为止很容易... 这是我的代码:
#! /usr/bin/perl
use strict;
use warnings;
use List::Util qw(sum);
############################
## call with
## perl average_intensities.pl IN1.bed IN2.bedGraph > OUT.bedGraph
############################
my ($file1, $file2) = @ARGV;
if (not defined $file1) {
die "Need name INPUT 1 file (bed)\n";
}
if (not defined $file2) {
die "Need name INPUT 2 file (bedGraph)\n";
}
#declare stuff for first file
my @coords1;
my $chr1;
my $start1;
my $end1;
my @coords2;
my $chr2;
my $start2;
my $end2;
my $int;
my @intensity;
my $av_int;
print "about to open files\n"; ## <<-- this doesn't even print :(
open (IN1, '<', $file1) or die "Could not open $file1: $! \n";
open (IN2, '<' ,$file2) or die "Could not open $file2: $! \n";
#parse first file and get teh first coordinates
while(<IN1>){
chomp $_;
@coords1 = split "\t", $_;
$chr1 = $coords1[0];
$start1 = $coords1[1];
$end1 = $coords1[2];
#parse second file and get the coordinates + intensities
while(<IN2>){
chomp $_;
@coords2 = split "\t", $_;
$chr2 = $coords2[0];
$start2 = $coords2[1];
$end2 = $coords2[2];
$int = $coords2[3];
if ($chr1 eq $chr2){
# if the coordinates on bedGraph are still < than those on bed save the average intensity
if($start1 <= $end2){
push @intensity, $int;
} else {
if (scalar @intensity >0){
$av_int = sum(@intensity)/(scalar @intensity);
print join ("\t", $chr1, $start1, $start2, $av_int),"\n";
@intensity = ();
last;
}
}
} else {
next;
}
}
}
close(IN1);
close(IN2);
然而,当我尝试运行它时,它会告诉我
Use of uninitialized value $start2 in numeric le (<=) at average_intensities.pl line 49, <IN2> line 1.
Use of uninitialized value $start1 in numeric le (<=) at average_intensities.pl line 49, <IN2> line 1.
(...并继续文件中的所有行)我无法理解为什么因为我确实声明了两个变量。 我不确定在这一点导致它的代码有什么问题...... 任何建议都会很棒! 谢谢:))
###########################################
以下更新的代码 我按照Kenosis的建议更正了代码,并稍微修改了他的脚本:
open IN1, "$file1" or die "Could not open file: $! \n";
open IN2, "$file2" or die "Could not open file: $! \n";
my %bedGraphHoA;
while (<IN1>) {
my @cols = split;
push @{ $bedGraphHoA{ $cols[0] } }, [ @cols[ 1 .. 3 ] ];
}
close IN1;
while (<IN2>) {
my ( @bedGaphLines, @bedGaphVals );
my @cols = split;
if ( exists $bedGraphHoA{ $cols[0] } ) {
for my $elements ( @{ $bedGraphHoA{ $cols[0] } } ) {
if ( $elements->[0] >= $cols[1] and $elements->[1] <= $cols[2] ) {
push @bedGaphLines, $elements;
push @bedGaphVals, $elements->[2];
}
}
if (scalar @bedGaphVals > 0){
my $mean = ( sum @bedGaphVals ) / @bedGaphVals;
print join( "\t", $cols[0],$cols[1], $cols[2], $mean ), "\n";
}
}
}
close IN2;
我在真实数据的一个子集上进行了测试,看起来很有效
答案 0 :(得分:1)
你有:
@coords1 = split $line1, "\t";
当你的意思是:
@coords1 = split "\t", $line1;
以后相同,你有:
@coords2 = split $line2, "\t";
你的意思是:
@coords2 = split "\t", $line2;
$start1
和$start2
分别从split
和@coords1
中@coords2
的结果中获取值。
也许以下内容将为您的努力提供一些方向:
use strict;
use warnings;
use List::Util qw/sum/;
my %bedGraphHoA;
open my $bedGraphFH, '<', 'bedGraph.txt' or die $!;
while (<$bedGraphFH>) {
my @cols = split;
push @{ $bedGraphHoA{ $cols[0] } }, [ @cols[ 1 .. 3 ] ];
}
close $bedGraphFH;
open my $bedFH, '<', 'bed.txt' or die $!;
while (<$bedFH>) {
my ( @bedGaphLines, @bedGaphVals );
my @cols = split;
if ( exists $bedGraphHoA{ $cols[0] } ) {
for my $elements ( @{ $bedGraphHoA{ $cols[0] } } ) {
if ( $elements->[0] >= $cols[1] and $elements->[1] <= $cols[2] ) {
push @bedGaphLines, $elements;
push @bedGaphVals, $elements->[2];
}
}
}
my $mean = ( sum @bedGaphVals ) / @bedGaphVals;
print join( "\t", $cols[0], @{ $bedGaphLines[$_] }, $mean ), "\n"
for 0 .. $#bedGaphLines;
}
close $bedFH;
__END__
bedGraph.txt:
chr 1 10 1.23413
chr 11 18 0.234
chr 20 24 4.231
chr 57 100 2.123413
chr 101 123 2.333
chr 120 123 7.555
chr 150 200 1.275
bed.txt:
chr 1 100
chr 110 400
Output:
chr 1 10 1.23413 1.95563575
chr 11 18 0.234 1.95563575
chr 20 24 4.231 1.95563575
chr 57 100 2.123413 1.95563575
chr 120 123 7.555 4.415
chr 150 200 1.275 4.415