使用awk修剪多个fasta文件中的前N个碱基,并使用最大宽度格式进行打印

时间:2015-05-15 16:51:50

标签: awk gawk fasta

背景

多fasta格式包含几个序列记录,每个记录以单行描述开头,后面是几行序列(RNA,DNA,蛋白质)。在">"之后,描述行在开头有一个greaterthan符号。是序列的标识符,行的其余部分包含记录的描述(两者都是可选的)。

fasta文件中的

通常是格式化为最大宽度

的行序列

输入示例,最大宽度=" 70":

>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
MMTYIVFILSTIFVVSFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGY
TTAMATEPYPEAWTSNKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEA
MGIAALYSYGTWLVVVTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
MTYTLFLLSVILVMGFVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYT
TAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSI
GAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
MTYTLFLLSVILVMGFVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYT
TAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSI
GAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN

我的努力

我写了一个gawk脚本,它确实在序列中保留修剪(切割前N个碱基),如果序列的长度小于修剪,则不打印它。

gawk -v left_trim=15 -v width=70 '
  BEGIN{RS=">"}
  NR==1{next};        #for blank record
  {
    split ($0, a, "\n"); 
    sequence=""; 
    for(i=2; i<=length(a); i++){
      sequence=sequence a[i];
    }; 
    if(length(sequence)>left_trim) {
      printf(">%s\n%s\n",a[1], substr(sequence, left_trim+1));
    }
  }' test.fasta

结果

>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTSNKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVVVTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN

预期

>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN

问题

  • 我怎样才能实现宽度格式?我尝试使用">%s\n%.70s\n",打印前70个字母,">%s\n%70s\n"打印所有

  • 我如何改善此加入?退出函数?

    sequence=""; 
    for(i=2; i<=length(a); i++){
      sequence=sequence a[i];
    };
    

1 个答案:

答案 0 :(得分:1)

要回答您的具体问题,您可以使用*格式修饰符指定输出字段的宽度:

$ awk 'BEGIN{printf "%s\n", "foo"}'
foo
$ awk 'BEGIN{printf "%*s\n", 10, "foo"}'
       foo

并且没有,没有join函数将数组重新组合成一个字符串(与split()相反)但是如果你让awk将记录拆分为字段而不是手动拆分记录到一个元素数组中,然后你可以为任何字段分配一个新值,awk会将字段重新编译为$ 0,因为当我删除第一个字段时,我在下面的第一个解决方案中做了一个故意的副作用/与$1 = ""对齐。

以下是我如何处理任务的方法:

$ cat tst.awk
BEGIN { RS=">"; FS="\n"; OFS="" }
NR>1 {
    print RS $1
    $1 = ""
    for ( start=left_trim+1; start<=length(); start+=width ) {
        print substr($0,start,width)
    }
}

$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN

或者如果您愿意:

$ cat tst.awk
/^>/ { prtRec(); rec=""; print; next }
{ rec = rec $0 }
END { prtRec() }
function prtRec(        start) {
    for ( start=left_trim+1; start<=length(rec); start+=width ) {
        print substr(rec,start,width)
    }
}

$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN