我有一个表征基因组区域的文件,如下所示:
chrom chromStart chromEnd PGB
chr1 12874 28371 2
chr1 15765 21765 1
chr1 15795 28371 2
chr1 18759 24759 1
chr1 28370 34961 1
chr3 233278 240325 1
chr3 239279 440831 2
chr3 356365 362365 1
基本上PGB描述了基因组区域的类别,其特征在于其染色体编号(chrom),起始(chromStart)和结束(chromEnd)坐标。
我希望折叠重叠区域,使得PGB = 1和2的重叠区域属于新类别,PGB = 3.输出为:
chrom chromStart chromEnd PGB
chr1 12874 15764 2
chr1 15765 24759 3
chr1 24760 28369 2
chr1 28370 28371 3
chr1 28372 34961 1
chr3 233278 239278 1
chr3 239279 240325 3
chr3 240326 356364 2
chr3 356365 440831 3
基本上我希望获得一个报告唯一区域的输出文件。有两个标准。
首先,如果行之间的PGB(第4列)相同,则合并范围。例如。
chrom chromStart chromEnd PGB
chr1 1 10 1
chr1 5 15 1
输出
chrom chromStart chromEnd PGB
chr1 1 15 1
其次,如果PGB在行之间不同,则chr(第1列)相同,范围重叠(col2和3),报告重叠范围为PGB = 3以及各个类别唯一的范围。
例如。
chrom chromStart chromEnd PGB
chr1 30 100 1
chr1 50 150 2
输出
chrom chromStart chromEnd PGB
chr1 30 49 1
chr1 50 100 3
chr1 101 150 2
我希望能够更好地说明问题。
答案 0 :(得分:1)
我创建了一个我认为可以实现这一目标的脚本。
use List::Util qw(max);
use strict;
use warnings;
# Skip Header row
<DATA>;
# With only 60k rows, going to just slurp all the data
my %chroms;
while (<DATA>) {
chomp;
my ($chrom, $start, $end, $pgb) = split ' ';
# Basic Data Validation.
warn "chromStart is NaN, '$start'" if $start =~ /\D/;
warn "chromEnd is NaN, '$end'" if $end =~ /\D/;
warn "range not ordered, '$start' to '$end'" if $start > $end;
warn "PGB only can be 1 or 2, '$pgb'" if ($pgb ne '1' && $pgb ne '2');
push @{$chroms{$chrom}{$pgb}}, [$start, $end];
}
print "chrom chromStart chromEnd PGB\n";
# Process each Chrom
for my $chrom (sort keys %chroms) {
for my $pgb (sort keys %{$chroms{$chrom}}) {
my $ranges = $chroms{$chrom}{$pgb};
# Sort Ranges
@$ranges = sort {$a->[0] <=> $b->[0] || $a->[1] <=> $b->[1]} @$ranges;
# Combine overlapping and continguous ranges.
# - Note because we're dealing with integer ranges, 1-4 & 5-8 are contiguous
for my $i (0..$#$ranges-1) {
if ($ranges->[$i][1] >= $ranges->[$i+1][0] - 1) {
$ranges->[$i+1][0] = $ranges->[$i][0];
$ranges->[$i+1][1] = max($ranges->[$i][1], $ranges->[$i+1][1]);
$ranges->[$i] = undef;
}
}
@$ranges = grep {$_} @$ranges;
}
# Create pgb=3 for overlaps.
# - Save old ranges into aliases, and then start fresh
my $pgb1array = $chroms{$chrom}{1};
my $pgb2array = $chroms{$chrom}{2};
my @ranges;
# Always working on the first range in each array, until one of the arrays is empty
while (@$pgb1array && @$pgb2array) {
# Aliases to first element
my $pgb1 = $pgb1array->[0];
my $pgb2 = $pgb2array->[0];
# PGB1 < PGB2
if ($pgb1->[1] < $pgb2->[0]) {
push @ranges, [@{shift @$pgb1array}, 1]
# PGB2 < PGB1
} elsif ($pgb2->[1] < $pgb1->[0]) {
push @ranges, [@{shift @$pgb2array}, 2]
# There's overlap for all rest
} else {
# PGB1start < PGB2start
if ($pgb1->[0] < $pgb2->[0]) {
push @ranges, [$pgb1->[0], $pgb2->[0]-1, 1];
$pgb1->[0] = $pgb2->[0];
# PGB2start < PGB1start
} elsif ($pgb2->[0] < $pgb1->[0]) {
push @ranges, [$pgb2->[0], $pgb1->[0]-1, 2];
$pgb2->[0] = $pgb1->[0];
}
# (Starts are equal now)
# PGB1end < PGB2end
if ($pgb1->[1] < $pgb2->[1]) {
$pgb2->[0] = $pgb1->[1] + 1;
push @ranges, [@{shift @$pgb1array}, 3];
# PGB2end < PGB1end
} elsif ($pgb2->[1] < $pgb1->[1]) {
$pgb1->[0] = $pgb2->[1] + 1;
push @ranges, [@{shift @$pgb2array}, 3];
# PGB2end = PGB1end
} else {
push @ranges, [@$pgb1, 3];
shift @$pgb1array;
shift @$pgb2array;
}
}
}
# Append whichever is left over
push @ranges, map {$_->[2] = 1; $_} @$pgb1array;
push @ranges, map {$_->[2] = 2; $_} @$pgb2array;
printf "%-8s %-11s %-11s %s\n", $chrom, @$_ for (@ranges);
}
1;
__DATA__
chrom chromStart chromEnd PGB
chr1 12871 12873 2
chr1 12874 28371 2
chr1 15765 21765 1
chr1 15795 28371 2
chr1 18759 24759 1
chr1 28370 34961 1
chr3 233278 240325 1
chr3 239279 440831 2
chr3 356365 362365 1
结果:
chrom chromStart chromEnd PGB
chr1 12871 15764 2
chr1 15765 24759 3
chr1 24760 28369 2
chr1 28370 28371 3
chr1 28372 34961 1
chr3 233278 239278 1
chr3 239279 240325 3
chr3 240326 356364 2
chr3 356365 362365 3
chr3 362366 440831 2
注意:我创建此项主要是为了锻炼自己,但如果你以任何方式使用它,请告诉我。