我有一个文件如下:
Scaffold2 GeneWise mRNA 3038 6649
Scaffold2 GeneWise CDS 3038 3480
Scaffold2 GeneWise CDS 4175 4291
Scaffold3 GeneWise mRNA 2824 15173
Scaffold3 GeneWise CDS 2824 3302
Scaffold3 GeneWise CDS 4143 4344
我想要输出:
Scaffold2 GeneWise mRNA 3038 6649
Scaffold2 GeneWise CDS 3038 **3480**
Scaffold2 GeneWise 1st_intron **3480 4175**
Scaffold2 GeneWise CDS **4175** 4291
Scaffold3 GeneWise mRNA 2824 15173
Scaffold3 GeneWise CDS 2824 **3302**
Scaffold3 GeneWise 1st_intron **3302 4143**
Scaffold3 GeneWise CDS **4143** 4344
应该如下: 如果第3列是' mRNA',请取下一行的第5列和第4行,然后在包含第4和第5列的两行之间插入一个新行(如粗体数字所示)第三栏名为' 1st_intron'。
我从未处理过这样的问题,如果你能给我一些提示,那就太棒了。
答案 0 :(得分:2)
你可以使用这个简单的awk:
awk '$3=="mRNA"{p=1; print; next}
p{s=$1 FS $2 FS "1st_intron" FS $5; print; p=0; next}
s{print s, $4; s=""} 1' file | column -t
<强>输出:强>
Scaffold2 GeneWise mRNA 3038 6649
Scaffold2 GeneWise CDS 3038 3480
Scaffold2 GeneWise 1st_intron 3480 4175
Scaffold2 GeneWise CDS 4175 4291
Scaffold3 GeneWise mRNA 2824 15173
Scaffold3 GeneWise CDS 2824 3302
Scaffold3 GeneWise 1st_intron 3302 4143
Scaffold3 GeneWise CDS 4143 4344
column -t
仅用于格式化输出。
答案 1 :(得分:1)
$ cat tst.awk
p1 == "mRNA" { x=$5 }
p2 == "mRNA" { print $1, $2, "1st_intron", x, $4 }
{ print; p2=p1; p1=$3 }
$ awk -f tst.awk file | column -t
Scaffold2 GeneWise mRNA 3038 6649
Scaffold2 GeneWise CDS 3038 3480
Scaffold2 GeneWise 1st_intron 3480 4175
Scaffold2 GeneWise CDS 4175 4291
Scaffold3 GeneWise mRNA 2824 15173
Scaffold3 GeneWise CDS 2824 3302
Scaffold3 GeneWise 1st_intron 3302 4143
Scaffold3 GeneWise CDS 4143 4344
答案 2 :(得分:0)
Perl解决方案。
如果您不想做任何事情, $intron
为0。处理mRNA行时,它设置为1,因此$left
可以记住下一行的第一个数字,并将$intron
设置为2,这会打印内含子行并重置$intron
。
#!/usr/bin/perl
use warnings;
use strict;
my $intron = 0;
my ($left, $right);
while (<>) {
my @items = split;
if (1 == $intron) {
$left = $items[4];
$intron = 2;
} elsif (2 == $intron) {
print join "\t", @items[0, 1], '1st_intron', $left, $items[3];
print "\n";
$intron = 0;
}
$intron = 1 if 'mRNA' eq $items[2];
print;
}
答案 3 :(得分:0)
awk有一个很好的预见功能&#34; getline&#34;:
awk '$3=="mRNA"{print;getline;c5=$5;print;getline;print $1," ",$2," 1st_intron",c5,$4;print}'
测试:
Scaffold2 GeneWise mRNA 3038 6649
Scaffold2 GeneWise CDS 3038 3480
Scaffold2 GeneWise 1st_intron 3480 4175
Scaffold2 GeneWise CDS 4175 4291
Scaffold3 GeneWise mRNA 2824 15173
Scaffold3 GeneWise CDS 2824 3302
Scaffold3 GeneWise 1st_intron 3302 4143
Scaffold3 GeneWise CDS 4143 4344