我处理生物数据(拷贝数变异),显示为间隔(制表符分隔文件):
档案1
Columns: Chromosome, Start, End, Annotation
1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
我与他们交叉以巩固重叠事件(50%的重叠是我的条件),结果如下:
我使用了Bedtools(http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html)中的intersectBed:
$ intersectBed -a File1 -b File1 -loj -f 0.50 -r > File 2
文件2
Columns: Chromosome, Start, End, Annotation , Chromosome, Start, End, Annotation
1 1 10 A 1 1 10 A
1 1 10 A 1 3 12 B
1 3 12 B 1 1 10 A
1 3 12 B 1 3 12 B
1 3 12 B 1 7 15 C
1 7 15 C 1 3 12 B
1 7 15 C 1 7 15 C
1 20 30 D 1 20 30 D
1 35 45 E 1 35 45 E
1 35 45 E 1 37 45 F
1 37 45 F 1 35 45 E
1 37 45 F 1 37 45 F
1 50 60 G 1 50 60 G
1 50 60 G 1 50 65 H
1 50 65 H 1 50 60 G
1 50 65 H 1 50 65 H
事件A和事件C与事件B重叠,事件E和F彼此重叠,如G和H,最后事件D没有重叠的伙伴。知道了这一点,合并的CNV列表应该是:
档案3
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H
我试图使用HDCNV java软件(http://daleylab.org/lab/?page_id=125)的合并选项,但输出不是我需要的。我试着写一个perl代码,但我是初学者,所以这个问题目前已超出我的极限。
如果您能帮助我使用文件2作为输入并输出文件3的漂亮的perl或awk代码,我将不胜感激。
提前致谢
答案 0 :(得分:0)
我假设这些列具有以下含义:
此脚本查找指定区域之间的重叠区域。它假定输入文本按col 1然后col 2排序。我已将输入文本放在一个字符串中,但您可能会从文件中读取它(并将数据输出到文件中)。我会告诉你如何做到这一点 - 这很简单,perl网站上有很多文档。
#!/usr/bin/perl
use strict;
use warnings;
use feature ":5.10";
use Data::Dumper;
my $text = '1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
2 1 10 I
2 3 12 J
2 7 15 K
2 20 30 L
2 35 45 M
2 37 45 N
2 50 60 O
2 50 65 P
';
# we have tab-delimited data.
# split on line breaks, remove line ending, split on tabs
my @lines = map { chomp; [ split(/\t/, $_) ]; } split("\n", $text);
my $col_0 = 1;
my $min = 0;
my $max = 0;
my @range;
foreach (@lines) {
# the chromosome number has changed or
# minimum is greater than current maximum:
# start a new interval
if ($col_0 != $_->[0] || $_->[1] > $max) {
if (@range) {
# print out the range, and restart the stack
say join("\t",
$col_0,
( $min || $_->[1] ),
( $max || $_->[2] ),
join(", ", @range)
);
}
@range = ( $_->[3] );
# set the min and max
$col_0 = $_->[0];
$min = $_->[1];
$max = $_->[2];
}
else {
# the minimum is lower than our current maximum.
# check whether the max is greater than our current
# max and increase it if so. Add the letter to the
# current range.
if ($_->[2] > $max) {
$max = $_->[2];
}
push @range, $_->[3];
}
}
# print out the last line
say join("\t", $col_0, $min, $max, join(", ", @range) );
输出:
1 1 15 A, B, C
1 20 30 D
1 35 45 E, F
1 50 65 G, H
2 1 15 I, J, K
2 20 30 L
2 35 45 M, N
2 50 65 O, P
我刚刚计算了简单的重叠 - 这并没有做50%的重叠。使用此脚本作为开始,您可以弄清楚如何执行此操作。我们没有为你做博士学位! ;)
答案 1 :(得分:0)
awk '
$2 > end && NR>1 {
print "1", start, end, pair;
start=end=pair=0
}
{
if (!start) { start = $2 };
end = $3;
pair = (pair ? pair "," $4 : $4)
}
END {
print "1", start, end, pair
}' file
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H
答案 2 :(得分:0)
假设有序数据,以下存根应该处理合并记录。
只需修改它就可以加载并输出到文件中。
use strict;
use warnings;
use List::Util qw(min max);
my $last;
while (<DATA>) {
my @fields = split;
if ( !$last ) {
$last = \@fields;
} elsif ( $last->[0] == $fields[0] && $last->[2] > $fields[1] ) {
$last->[1] = min( $last->[1], $fields[1] );
$last->[2] = max( $last->[2], $fields[2] );
$last->[3] .= ",$fields[3]";
} else {
print join( "\t", @$last ), "\n";
$last = \@fields;
}
}
print join( "\t", @$last ), "\n";
__DATA__
1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
2 1 10 I
2 3 12 J
2 7 15 K
2 20 30 L
2 35 45 M
2 37 45 N
2 50 60 O
2 50 65 P
输出:
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H
2 1 15 I,J,K
2 20 30 L
2 35 45 M,N
2 50 65 O,P
答案 3 :(得分:0)
我的看法:
awk -F "\t" -v OFS="\t" '
function emit() {print chrom, start, end, annot}
$1 == chrom && ((start<=$2 && $2<=end) || (start<=$3 && $3<=end)) {
annot = annot "," $4
if ($2 < start) start = $2
if ($3 > end) end = $3
next
}
chrom {emit()}
{chrom=$1; start=$2; end=$3; annot=$4}
END {emit()}
' file1
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H