合并间隔

时间:2014-09-11 01:26:12

标签: perl awk intervals bioinformatics

我处理生物数据(拷贝数变异),显示为间隔(制表符分隔文件):

档案1

Columns: Chromosome, Start, End, Annotation

1   1   10  A
1   3   12  B
1   7   15  C
1   20  30  D
1   35  45  E
1   37  45  F
1   50  60  G
1   50  65  H

我与他们交叉以巩固重叠事件(50%的重叠是我的条件),结果如下:

我使用了Bedtools(http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html)中的intersectBed:

 $ intersectBed -a File1 -b File1 -loj -f 0.50 -r > File 2

文件2

Columns: Chromosome, Start, End, Annotation , Chromosome, Start, End, Annotation

    1       1       10      A       1       1       10      A
    1       1       10      A       1       3       12      B
    1       3       12      B       1       1       10      A
    1       3       12      B       1       3       12      B
    1       3       12      B       1       7       15      C
    1       7       15      C       1       3       12      B
    1       7       15      C       1       7       15      C
    1       20      30      D       1       20      30      D
    1       35      45      E       1       35      45      E
    1       35      45      E       1       37      45      F
    1       37      45      F       1       35      45      E
    1       37      45      F       1       37      45      F
    1       50      60      G       1       50      60      G
    1       50      60      G       1       50      65      H
    1       50      65      H       1       50      60      G
    1       50      65      H       1       50      65      H

事件A和事件C与事件B重叠,事件E和F彼此重叠,如G和H,最后事件D没有重叠的伙伴。知道了这一点,合并的CNV列表应该是:

档案3

1    1  15 A,B,C
1   20  30 D
1   35  45 E,F
1   50  65 G,H

我试图使用HDCNV java软件(http://daleylab.org/lab/?page_id=125)的合并选项,但输出不是我需要的。我试着写一个perl代码,但我是初学者,所以这个问题目前已超出我的极限。

如果您能帮助我使用文件2作为输入并输出文件3的漂亮的perl或awk代码,我将不胜感激。

提前致谢

4 个答案:

答案 0 :(得分:0)

我假设这些列具有以下含义:

  • col 1:染色体编号
  • col 2:基因组区域的起始位置
  • col 3:基因组区域的最终位置
  • col 4:文本标识符

此脚本查找指定区域之间的重叠区域。它假定输入文本按col 1然后col 2排序。我已将输入文本放在一个字符串中,但您可能会从文件中读取它(并将数据输出到文件中)。我会告诉你如何做到这一点 - 这很简单,perl网站上有很多文档。

#!/usr/bin/perl
use strict;
use warnings;
use feature ":5.10";
use Data::Dumper;

my $text = '1   1   10  A
1   3   12  B
1   7   15  C
1   20  30  D
1   35  45  E
1   37  45  F
1   50  60  G
1   50  65  H
2   1   10  I
2   3   12  J
2   7   15  K
2   20  30  L
2   35  45  M
2   37  45  N
2   50  60  O
2   50  65  P
';

# we have tab-delimited data.
# split on line breaks, remove line ending, split on tabs
my @lines = map { chomp; [ split(/\t/, $_) ]; } split("\n", $text);

my $col_0 = 1;
my $min = 0;
my $max = 0;
my @range;

foreach (@lines) {
    # the chromosome number has changed or
    # minimum is greater than current maximum:
    # start a new interval
    if ($col_0 != $_->[0] || $_->[1] > $max) {
        if (@range) {
            # print out the range, and restart the stack
            say join("\t",
                $col_0,
                ( $min || $_->[1] ),
                ( $max || $_->[2] ),
                join(", ", @range)
            );
        }
        @range = ( $_->[3] );
        # set the min and max
        $col_0 = $_->[0];
        $min = $_->[1];
        $max = $_->[2];
    }
    else {
    # the minimum is lower than our current maximum.
    # check whether the max is greater than our current
    # max and increase it if so. Add the letter to the
    # current range.
        if ($_->[2] > $max) {
            $max = $_->[2];
        }
        push @range, $_->[3];
    }
}
# print out the last line
say join("\t", $col_0, $min, $max, join(", ", @range) );

输出:

1   1   15  A, B, C
1   20  30  D
1   35  45  E, F
1   50  65  G, H
2   1   15  I, J, K
2   20  30  L
2   35  45  M, N
2   50  65  O, P

我刚刚计算了简单的重叠 - 这并没有做50%的重叠。使用此脚本作为开始,您可以弄清楚如何执行此操作。我们没有为你做博士学位! ;)

答案 1 :(得分:0)

awk '
$2 > end && NR>1 { 
    print "1", start, end, pair; 
    start=end=pair=0 
} 
{ 
    if (!start) { start = $2 }; 
    end = $3; 
    pair = (pair ? pair "," $4 : $4)
}
END {
    print "1", start, end, pair
}' file

1  1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H

答案 2 :(得分:0)

假设有序数据,以下存根应该处理合并记录。

只需修改它就可以加载并输出到文件中。

use strict;
use warnings;

use List::Util qw(min max);

my $last;

while (<DATA>) {
    my @fields = split;

    if ( !$last ) {
        $last = \@fields;

    } elsif ( $last->[0] == $fields[0] && $last->[2] > $fields[1] ) {
        $last->[1] = min( $last->[1], $fields[1] );
        $last->[2] = max( $last->[2], $fields[2] );
        $last->[3] .= ",$fields[3]";

    } else {
        print join( "\t", @$last ), "\n";
        $last = \@fields;
    }
}

print join( "\t", @$last ), "\n";

__DATA__
1   1   10  A
1   3   12  B
1   7   15  C
1   20  30  D
1   35  45  E
1   37  45  F
1   50  60  G
1   50  65  H
2   1   10  I
2   3   12  J
2   7   15  K
2   20  30  L
2   35  45  M
2   37  45  N
2   50  60  O
2   50  65  P

输出:

1   1   15  A,B,C
1   20  30  D
1   35  45  E,F
1   50  65  G,H
2   1   15  I,J,K
2   20  30  L
2   35  45  M,N
2   50  65  O,P

答案 3 :(得分:0)

我的看法:

awk -F "\t" -v OFS="\t" '
  function emit() {print chrom, start, end, annot}
  $1 == chrom && ((start<=$2 && $2<=end) || (start<=$3 && $3<=end)) {
    annot = annot "," $4
    if ($2 < start) start = $2
    if ($3 > end) end = $3
    next
  }
  chrom {emit()}
  {chrom=$1; start=$2; end=$3; annot=$4}
  END {emit()}
' file1
1   1   15  A,B,C
1   20  30  D
1   35  45  E,F
1   50  65  G,H