如何在Python中找到相同的序列

时间:2014-10-16 02:37:18

标签: python sequences

我是Python的新手,我想知道如何在Python中从Fasta文件中找到相同的序列。 例如,这里我有4个记录序列读数,如何找到相同的序列并返回它们的ID?非常感谢!!

from Bio import SeqIO
record=list(SeqIO.parse("data/dna.txt", "fasta"))
for i in range(0,len(record)):
    print record[i].id,record[i].seq


seq1 GAATGCATACTGCATCGATA
seq2 CATAAAACGTCTCCATCGCT
seq3 TGCCCAAGTTGTGAAGTGTC
seq4 TGCCCAAGTTGTGAAGTGTC

2 个答案:

答案 0 :(得分:1)

您可以使用defaultdict编译每个序列的ID列表,如下所示:

from Bio import SeqIO
from collections import defaultdict
records=list(SeqIO.parse("data/dna.txt", "fasta"))
compilation = defaultdict(list)
for record in records:
    compilation[record.seq].append(record.id)

答案 1 :(得分:0)

最简单的方法是使用dict

from Bio import SeqIO
records = list(SeqIO.parse("data/dna.txt", "fasta"))
d = dict()
for record in records:
    if record.seq in d:
        d[record.seq].append(record)
    else:
        d[record.seq] = [record]
for seq, record_set in d.iteritems():
    print seq + ': (' + str(len(record_set)) + ')'
    for record in record_set:
        print '    ' + record.id

打印如:

GAATGCATACTGCATCGATA: (1)
    seq1
CATAAAACGTCTCCATCGCT: (1)
    seq2
TGCCCAAGTTGTGAAGTGTC: (2)
    seq3
    seq4