使用数据框更改fasta文件中的seq名称

时间:2018-05-04 08:45:20

标签: python pandas fasta

我遇到了问题,我解释了这一点。

我有一个fasta文件:

>seqA
AAAAATTTGG
>seqB
ATTGGGCCG
>seqC
ATTGGCC
>seqD
ATTGGACAG

和数据框:

seq name      New name seq
seqB            BOBO
seqC            JOHN

我想在fasta文件中更改我的ID seq名称,如果我的数据框中有相同的seq名称并将其更改为新名称seq,则会给出:

新的fasta fil:

>seqA
AAAAATTTGG
>BOBO
ATTGGGCCG
>JOHN
ATTGGCC
>seqD
ATTGGACAG

非常感谢

编辑: 我用过这个脚本:

blast=pd.read_table("matches_Busco_0035_0042.m8",header=None)
blast.columns = ["qseqid", "Busco_ID", "pident", "length", "mismatch", "gapopen","qstart", "qend", "sstart", "send", "evalue", "bitscore"]

repl = blast[blast.pident > 95]

print(repl)

#substituion dataframe

newfile = []
count = 0

for rec in SeqIO.parse("concatenate_0035_0042_aa2.fa", "fasta"):
    #get corresponding value for record ID from dataframe
    x = repl.loc[repl.seq == rec.id, "Busco_ID"]
    #change record, if not empty
    if x.any():
        rec.name = rec.description = rec.id = x.iloc[0]
        count += 1
    #append record to list
    newfile.append(rec)

#write list into new fasta file
SeqIO.write(newfile, "changedtest.faa", "fasta")
#tell us, how hard you had to work for us
print("I changed {} entries!".format(count))

我收到以下错误:

Traceback (most recent call last):
  File "Get_busco_blast.py", line 74, in <module>
    x = repl.loc[repl.seq == rec.id, "Busco_ID"]
  File "/usr/local/lib/python3.6/site-packages/pandas/core/generic.py", line 3614, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'seq'

2 个答案:

答案 0 :(得分:1)

如果您安装了Biopython,那么您可以使用SeqIO来读取/写入fasta文件:

from Bio import SeqIO

#substituion dataframe
repl = pd.DataFrame(np.asarray([["seqB_3652_i36", "Bob"], ["seqC_123_6XXX1", "Patrick"]]), columns = ["seq", "newseq"])

newfile = []
count = 0

for rec in SeqIO.parse("test.faa", "fasta"):
    #get corresponding value for record ID from dataframe
    #repl["seq"] and "newseq" are the pandas column with the old and new sequence names, respectively
    x = repl.loc[repl["seq"] == rec.id, "newseq"]
    #change record, if not empty
    if x.any():
        #append old identifier number to the new id name
        rec.name = rec.description = rec.id = x.iloc[0] + rec.id[rec.id.index("_"):]
        count += 1
    #append record to list
    newfile.append(rec)

#write list into new fasta file
SeqIO.write(newfile, "changedtest.faa", "fasta")
#tell us, how hard you had to work for us
print("I changed {} entries!".format(count))

请注意,此脚本不会检查替换表中的多个条目。如果记录ID不在数据帧中,它只需要第一个元素或不改变任何东西。

答案 1 :(得分:1)

使用BioPython之类的内容更容易做到这一点。

首先创建一个字典

names = Series(df['seq name'].values,index=df['New seq name']).to_dict()

现在迭代

from Bio import SeqIO
outs = []
for record in SeqIO.parse("orig.fasta", "fasta"):
    record.id = names.get(record.id, default=record.id)
    outs.append(record)
SeqIO.write(open("new.fasta", "w"), outs, "fasta")