我有一个这样的fasta文件: test_fasta.fasta
>XXKHH_1
AAAAATTTCTGGGCCCC
>YYYXXKHH_1
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>TTDTT_11
TTTGGGAATTAAACCCT
>ID_2SS
TTTGGGAATTAAACCCT
>YKHH_1
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>YKHSH_1S
TTAAAAATTTCTGGGCCCCGGGAAAAAA
我想获取重复序列的计数,并将每个序列的总数添加到文件中(从最大到最小排序),并得到如下所示的结果:
>YYYXXKHH_1_counts3
TTAAAAATTTCTGGGCCCCGGGAAAAAA
>TTDTT_11_counts2
TTTGGGAATTAAACCCT
>XXKHH_1_counts1
AAAAATTTCTGGGCCCC
我有这段代码可以找到重复的序列并将其ID结合在一起,但是我没有希望将它们结合在一起,而是希望将ID的重复计数附加到结果中,如上所示。
from Bio import SeqIO
from collections import defaultdict
dedup_records = defaultdict(list)
for record in SeqIO.parse("test_fasta.fasta", "fasta"):
# Use the sequence as the key and then have a list of id's as the value
dedup_records[str(record.seq)].append(record.id)
with open("Output.fasta", 'w') as output:
for seq, ids in dedup_records.items():
# Join the ids and write them out as the fasta
output.write(">{}\n".format('|'.join(ids)))
output.write(seq + "\n")
答案 0 :(得分:1)
由于在输出循环的ids
列表中已经具有每个重复记录的ID,因此您可以简单地输出第一个ID(显然是您期望的输出),然后输出长度。 ids
列表:
for seq, ids in sorted(dedup_records.items(), key=lambda t: len(t[1]), reverse=True):
output.write(">{}_counts{}\n".format(ids[0], len(ids)))
output.write(seq + "\n")