在模式匹配后插入一行

时间:2015-09-27 15:18:42

标签: bash awk

我有一个文件如下:

Scaffold2   GeneWise        mRNA    3038    6649 
Scaffold2   GeneWise        CDS     3038    3480
Scaffold2   GeneWise        CDS     4175    4291
Scaffold3   GeneWise        mRNA    2824    15173
Scaffold3   GeneWise        CDS     2824    3302
Scaffold3   GeneWise        CDS     4143    4344

我想要输出:

Scaffold2   GeneWise        mRNA    3038    6649 
Scaffold2   GeneWise        CDS     3038    **3480**
Scaffold2   GeneWise        1st_intron     **3480    4175**
Scaffold2   GeneWise        CDS     **4175**    4291
Scaffold3   GeneWise        mRNA    2824    15173
Scaffold3   GeneWise        CDS     2824    **3302**
Scaffold3   GeneWise        1st_intron     **3302    4143**
Scaffold3   GeneWise        CDS     **4143**    4344

应该如下: 如果第3列是' mRNA',请取下一行的第5列和第4行,然后在包含第4和第5列的两行之间插入一个新行(如粗体数字所示)第三栏名为' 1st_intron'。

我从未处理过这样的问题,如果你能给我一些提示,那就太棒了。

4 个答案:

答案 0 :(得分:2)

你可以使用这个简单的awk:

awk '$3=="mRNA"{p=1; print; next}
     p{s=$1 FS $2 FS "1st_intron" FS $5; print; p=0; next}
     s{print s, $4; s=""} 1' file | column -t

<强>输出:

Scaffold2  GeneWise  mRNA        3038  6649
Scaffold2  GeneWise  CDS         3038  3480
Scaffold2  GeneWise  1st_intron  3480  4175
Scaffold2  GeneWise  CDS         4175  4291
Scaffold3  GeneWise  mRNA        2824  15173
Scaffold3  GeneWise  CDS         2824  3302
Scaffold3  GeneWise  1st_intron  3302  4143
Scaffold3  GeneWise  CDS         4143  4344

column -t仅用于格式化输出。

答案 1 :(得分:1)

$ cat tst.awk
p1 == "mRNA" { x=$5 }
p2 == "mRNA" { print $1, $2, "1st_intron", x, $4 }
{ print; p2=p1; p1=$3 }

$ awk -f tst.awk file | column -t
Scaffold2  GeneWise  mRNA        3038  6649
Scaffold2  GeneWise  CDS         3038  3480
Scaffold2  GeneWise  1st_intron  3480  4175
Scaffold2  GeneWise  CDS         4175  4291
Scaffold3  GeneWise  mRNA        2824  15173
Scaffold3  GeneWise  CDS         2824  3302
Scaffold3  GeneWise  1st_intron  3302  4143
Scaffold3  GeneWise  CDS         4143  4344

答案 2 :(得分:0)

Perl解决方案。

如果您不想做任何事情,

$intron为0。处理mRNA行时,它设置为1,因此$left可以记住下一行的第一个数字,并将$intron设置为2,这会打印内含子行并重置$intron

#!/usr/bin/perl
use warnings;
use strict;

my $intron = 0;
my ($left, $right);
while (<>) {
    my @items = split;

    if (1 == $intron) {
        $left = $items[4];
        $intron = 2;

    } elsif (2 == $intron) {
        print join "\t", @items[0, 1], '1st_intron', $left, $items[3];
        print "\n";
        $intron = 0;
    }

    $intron = 1 if 'mRNA' eq $items[2];
    print;
}

答案 3 :(得分:0)

awk有一个很好的预见功能&#34; getline&#34;:

awk '$3=="mRNA"{print;getline;c5=$5;print;getline;print $1," ",$2,"       1st_intron",c5,$4;print}'

测试:

Scaffold2   GeneWise        mRNA    3038    6649
Scaffold2   GeneWise        CDS     3038    3480
Scaffold2   GeneWise        1st_intron 3480 4175
Scaffold2   GeneWise        CDS     4175    4291
Scaffold3   GeneWise        mRNA    2824    15173
Scaffold3   GeneWise        CDS     2824    3302
Scaffold3   GeneWise        1st_intron 3302 4143
Scaffold3   GeneWise        CDS     4143    4344