背景
多fasta格式包含几个序列记录,每个记录以单行描述开头,后面是几行序列(RNA,DNA,蛋白质)。在">"之后,描述行在开头有一个greaterthan符号。是序列的标识符,行的其余部分包含记录的描述(两者都是可选的)。
fasta文件中的通常是格式化为最大宽度
的行序列输入示例,最大宽度=" 70":
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus] MMTYIVFILSTIFVVSFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGY TTAMATEPYPEAWTSNKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEA MGIAALYSYGTWLVVVTGWSLLIGVLVIMEVTRGN >gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki] MTYTLFLLSVILVMGFVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYT TAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSI GAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN >gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus] MTYTLFLLSVILVMGFVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYT TAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSI GAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN
我的努力
我写了一个gawk脚本,它确实在序列中保留修剪(切割前N个碱基),如果序列的长度小于修剪,则不打印它。
gawk -v left_trim=15 -v width=70 '
BEGIN{RS=">"}
NR==1{next}; #for blank record
{
split ($0, a, "\n");
sequence="";
for(i=2; i<=length(a); i++){
sequence=sequence a[i];
};
if(length(sequence)>left_trim) {
printf(">%s\n%s\n",a[1], substr(sequence, left_trim+1));
}
}' test.fasta
结果
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus] SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTSNKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVVVTGWSLLIGVLVIMEVTRGN >gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki] FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN >gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus] FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN
预期
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus] SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV VTGWSLLIGVLVIMEVTRGN >gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki] FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV TGWTLLVGVYIVIEIARGN >gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus] FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV TGWTLLVGVYIVIEIARGN
问题
我怎样才能实现宽度格式?我尝试使用">%s\n%.70s\n"
,打印前70个字母,">%s\n%70s\n"
打印所有
我如何改善此加入?退出函数?
sequence="";
for(i=2; i<=length(a); i++){
sequence=sequence a[i];
};
答案 0 :(得分:1)
要回答您的具体问题,您可以使用*
格式修饰符指定输出字段的宽度:
$ awk 'BEGIN{printf "%s\n", "foo"}'
foo
$ awk 'BEGIN{printf "%*s\n", 10, "foo"}'
foo
并且没有,没有join
函数将数组重新组合成一个字符串(与split()
相反)但是如果你让awk将记录拆分为字段而不是手动拆分记录到一个元素数组中,然后你可以为任何字段分配一个新值,awk会将字段重新编译为$ 0,因为当我删除第一个字段时,我在下面的第一个解决方案中做了一个故意的副作用/与$1 = ""
对齐。
以下是我如何处理任务的方法:
$ cat tst.awk
BEGIN { RS=">"; FS="\n"; OFS="" }
NR>1 {
print RS $1
$1 = ""
for ( start=left_trim+1; start<=length(); start+=width ) {
print substr($0,start,width)
}
}
$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
或者如果您愿意:
$ cat tst.awk
/^>/ { prtRec(); rec=""; print; next }
{ rec = rec $0 }
END { prtRec() }
function prtRec( start) {
for ( start=left_trim+1; start<=length(rec); start+=width ) {
print substr(rec,start,width)
}
}
$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN