我想用带有.tsv部分的列表来更改我的fasta标头的部分。
我不是生物信息学家,而是具有bash和python初学者技能的微生物学家。谢谢。
示例:
标题:
Prevalence_Sequence_ID:1 | ARO:3003072 | RES:mphL |蛋白质同源模型
使用
.tsv
ARO:3003072 mphL mphL是一种染色体编码的大环内酯磷酸转移酶,可灭活14和15元大环内酯,例如红霉素,克拉霉素,阿奇霉素。
到
新标题
Prevalence_Sequence_ID:1 | mphL mphL是一种染色体编码的大环内酯磷酸转移酶,可灭活14和15元大环内酯,例如红霉素,克拉霉素,阿奇霉素。 | RES:mphL |蛋白质同源模型>
.tsv中没有给出fasta标头中的ARO,然后就忽略它。
快速法示例
>Prevalence_Sequence_ID:1|ARO:3003072|RES:mphL|Protein Homolog Model
MTTLKVKQLANKKGLNILEDS
>gb|ARO:3004145|RES:AxyZ|Achromobacter_insuavis_AXX-A_
MARKTKEESQRTRDRILDAAEHVFLSKG
>Prevalence_Sequence_ID:31298|ARO:3000777|RES:adeF|Protein Homolog Model
MDFSRFFIDRPIFAAVLSILIFI
.tsv
示例ARO:3003072 mphL mphL is a chromosomally-encoded macrolide phosphotransferases that inactivate 14- and 15-membered macrolides such as erythromycin, clarithromycin, azithromycin.
ARO:3004145 AxyZ AxyZ is a transcriptional regulator of the AxyXY-OprZ efflux pump system.
ARO:3000777 adeF AdeF is the membrane fusion protein of the multidrug efflux complex AdeFGH.
答案 0 :(得分:0)
如果序列不需要排序,我们可以用第二个字段对fasta进行排序,用|
作为分隔符替换.tsv中的第一个空格,然后通过第一个字段对其进行排序,然后使用适当的输出进行合并格式:
cat <<EOF >fasta
>Prevalence_Sequence_ID:1|ARO:3003072|RES:mphL|Protein Homolog Model
MTTLKVKQLANKKGLNILEDS
>gb|ARO:3004145|RES:AxyZ|Achromobacter_insuavis_AXX-A_
MARKTKEESQRTRDRILDAAEHVFLSKG
>Prevalence_Sequence_ID:31298|ARO:3000777|RES:adeF|Protein Homolog Model
MDFSRFFIDRPIFAAVLSILIFI
EOF
cat <<EOF >tsv
ARO:3003072 mphL mphL is a chromosomally-encoded macrolide phosphotransferases that inactivate 14- and 15-membered macrolides such as erythromycin, clarithromycin, azithromycin.
ARO:3004145 AxyZ AxyZ is a transcriptional regulator of the AxyXY-OprZ efflux pump system.
ARO:3000777 adeF AdeF is the membrane fusion protein of the multidrug efflux complex AdeFGH.
EOF
join -t'|' -12 -21 -o1.1,2.2,1.3 <(
<fasta sort -t'|' -k2) <(
<tsv sed 's/ /|/' | sort -t'|' -k1)
如果您需要根据Fasta对输出进行排序,我们可以使用nl -w1
对行进行编号,然后进行合并,然后使用数字对输出进行排序,并删除数字:
join -t'|' -12 -21 -o1.1,2.2,1.3 <(
<fasta nl -w1 | sort -t'|' -k2) <(
<tsv sed 's/ /|/' | sort -t'|' -k1) |
sort -t $'\t' -n -k2 | cut -f2-
答案 1 :(得分:0)
如果您使用awk,则可以执行以下步骤:
EventProcessingConfigurer
文件并将所有值存储到由第一列索引的数组中。tsv
开头),则
>
之后的第一个字符串)中提取密钥这些步骤也可以在python中完成,但是您可以使用以下行在awk中轻松完成此操作:
|
答案 2 :(得分:0)
import pandas as pd
from Bio import SeqIO
tsvdata = pd.read_csv('example.tsv', sep='/t', header=None, names=['aro','_', 'description'])
for record in SeqIO.parse("example.fasta", "fasta"):
fasta_record = str(record).split('|')
key = fasta_record[1]
fasta_record[1]=tsvdata[tsvdata['aro']==key]['description'].values[0]
print('|'.join(fasta_record))
答案 3 :(得分:0)
我将您的示例Fasta和TSV数据保存到example.fasta
和example.tsv
中。这是输入文件的内容-
$ cat example.fasta
>Prevalence_Sequence_ID:1|ARO:3003072|RES:mphL|Protein Homolog Model
MTTLKVKQLANKKGLNILEDS
>gb|ARO:3004145|RES:AxyZ|Achromobacter_insuavis_AXX-A_
MARKTKEESQRTRDRILDAAEHVFLSKG
>Prevalence_Sequence_ID:31298|ARO:3000777|RES:adeF|Protein Homolog Model
MDFSRFFIDRPIFAAVLSILIFI
$ cat example.tsv
ARO:3003072 mphL mphL is a chromosomally-encoded macrolide phosphotransferases that inactivate 14- and 15-membered macrolides such as erythromycin, clarithromycin, azithromycin.
ARO:3004145 AxyZ AxyZ is a transcriptional regulator of the AxyXY-OprZ efflux pump system.
ARO:3000777 adeF AdeF is the membrane fusion protein of the multidrug efflux complex AdeFGH.
# import biopython, bioython needs to be installed in your environment/machine
from Bio.SeqIO.FastaIO import SimpleFastaParser as sfp
# read in the tsv data into a dict
with open("example.tsv") as tsvdata:
tsv_data = {line.strip().split("\t")[0]: " ".join(line.strip().split("\t")[1:])
for line in tsvdata}
# read input fasta file contents and write to a separate file in real time
with open("example_out.fasta", "w") as outfasta:
with open("example.fasta") as infasta:
for header, seq in sfp(infasta):
aro = header.strip().split("|")[1] # get ARO for header
header = header.replace(aro, tsv_data.get(aro, aro)) # lookup ARO in dict and replace if found, otherwise ignore it
outfasta.write(">{0}\n{1}\n".format(header, seq))
这是输出文件的内容-
$ cat example_out.fasta
>Prevalence_Sequence_ID:1|mphL mphL is a chromosomally-encoded macrolide phosphotransferases that inactivate 14- and 15-membered macrolides such as erythromycin, clarithromycin, azithromycin.|RES:mphL|Protein Homolog Model
MTTLKVKQLANKKGLNILEDS
>gb|AxyZ AxyZ is a transcriptional regulator of the AxyXY-OprZ efflux pump system.|RES:AxyZ|Achromobacter_insuavis_AXX-A_
MARKTKEESQRTRDRILDAAEHVFLSKG
>Prevalence_Sequence_ID:31298|adeF AdeF is the membrane fusion protein of the multidrug efflux complex AdeFGH.|RES:adeF|Protein Homolog Model
MDFSRFFIDRPIFAAVLSILIFI