更改fasta文件头

时间:2017-03-08 23:47:31

标签: python awk bioinformatics fasta

我有一些fasta文件,我想更改标题

>XP_001267680.1 conserved hypothetical protein [Aspergillus clavatus NRRL 1]
MTEILARLTAPSAYRYASCEILEDYGRQLRELIAYIKQPRTTADIATAAEFLLDNLDPSLHSASY...
>XP_001267682.1 60S ribosomal protein L18 [Aspergillus clavatus NRRL 1]
MGIDLDRHHVRSTHRKAPKSENVYLQVLVKLYRFLSRRTESNFNKVVLRRLFMSRINRPPVS...
etc...

我想更改fasta文件,看起来像这样:

>Acla00001
MTEILARLTAPSAYRYASCEILEDYGRQLRELIAYIKQPRTTADIATAAEFLLDNLDPSLHSASY...
>Acla00002
MGIDLDRHHVRSTHRKAPKSENVYLQVLVKLYRFLSRRTESNFNKVVLRRLFMSRINRPPVS...
...
>Acla03871
MTEILARLTAPSAYRYASCEILEDYGRQLRELIAYIKQPRTTADIATAAEFLLDNLDPSLHSASYLF...
>Acla03872
MGIDLDRHHVRSTHRKAPKSENVYLQVLVKLYRFLSRRTESNFNKVVLRRLFMSRINRPPVSL...

如果该行以>开头,我发现这段代码可以删除所有内容并添加一个新的>有机体名称+数字。

org = 'Acla'    
os.popen("""cat %s.fa | awk '/^>/{print ">%s" ++i; next}{print}'""" % (org, org)).read()

我想通过添加零来使所有这些行的长度相等,因此数字是5位数,或者字符串的总长度是10。

1 个答案:

答案 0 :(得分:1)

将print statement更改为

 /^>/{printf ">Acla%05d\n",++i ...