基于biopython序列从fasta文件中提取基因

时间:2017-09-12 11:27:08

标签: python biopython

我有一个我正在解析的fasta文件。该文件由几个序列组成,这些序列属于来自不同细菌菌株的相同基因。我想做的,脚本的作用是检查序列是否与参考序列相同或不同。有了这些信息,我想生成一个新文件,但我只有一个序列。

def checking_sequences():
gene_records=list(SeqIO.parse('/files/gene_A.fasta', 'fasta'))
ref_id=gene_records[-1].id
ref_seq=gene_records[-1].seq
#print gene_records[-1].description
output_handle=open('//files/' + 'corrected_gene_1', 'a')
print len(gene_records)
count=0
dif_count=0
reference_list=[]

for gene_record in gene_records:
    #count+=1
    if len(gene_record.seq) == len(ref_seq):
    #print len(gene_records.seq)
    #print len(ref_seq)
        print 'Found all lengths are equal'                     
    else:
        print 'Found %s sequence with different lengths' % (gene_records.description)

    ###checking sequence equality
    if gene_record.seq==ref_seq:
        count+=1
        gene_record.id=gene_record.id +'_0'
        reference_list.append(gene_record)
        ref_count=reference_list.count(gene_record.seq)
        print 'There are %i sequences are  equal to the reference sequence' %(count)    
    else:       
        dif_count+=1
        reference_list.append(gene_record.seq)
        seq_count=reference_list.count(gene_record.seq)
        gene_record.id=gene_record.id +'_'+ str(dif_count)
        print 'Found  %i  different that ref_seq' % (seq_count)
        print 'xxxxxxxxxxx'




        #print seq_count
        #print len(reference_list)  
    SeqIO.write(gene_record, output_handle, 'fasta')


output_handle.close()   

checking_sequences() 有些澄清:

original file                           desire output
    >gene_1 strainIDx                     >gen1_strainIDx
    seqA                                    seqA
    >gene_1 strainIDy                      >gene_1 strainIDy
   seqB                                       seqB
    >gene_1 strainIDz
    seqA

我不介意ID只是我希望每个都有一个seq。我试图使用“打破”但我没有得到我想要的输出。帮助将不胜感激

1 个答案:

答案 0 :(得分:0)

  

评论:我从来没有听说过哈希,所以我不知道它做了什么或不做什么。

参考:

  

内置函数hash(object)
  将哈希值作为整数返回   它们用于在字典查找期间快速比较字典键。

     

SO QA Built in python hash() function

  

问题:我想生成一个新文件,但每个文件只有一个序列。

使用查找表,例如:

lookup = Set()

with open('/files/' + 'corrected_gene_1', "w") as handle: 
    for record in SeqIO.parse('/files/gene_A.fasta', "fasta"):
        seq_hash = hash(str(record.seq))

        if not seq_hash in lookup:
            # Not in lookup, save
            lookup.add(seq_hash)
            SeqIO.write(records, handle, "fasta")
        else:
            # already saved - Skip
            pass