Question

我现在正在尝试创建一个可以翻译DNA序列然后将它们相互比较以删除重复的工具！

我用这个脚本来阅读我的fastq文件：

def sequence_cleaner(fastq_file, min_length=0, por_n=100):
   # Create our hash table to add the sequences
   sequences={}

   # Using the Biopython fastq parse we can read our fastq input
   for seq_record in SeqIO.parse(fastq_file, "fastq"):
       # Take the current sequence
       sequence = str(seq_record.seq).upper()
       # Check if the current sequence is according to the user parameters
       if (len(sequence) >= min_length and
           (float(sequence.count("N"))/float(len(sequence)))*100 <= por_n):
       # If the sequence passed in the test "is it clean?" and it isn't in the
       # hash table, the sequence and its id are going to be in the hash
           if sequence not in sequences:
               sequences[sequence] = seq_record.id
      # If it is already in the hash table, we're just gonna concatenate the ID
      # of the current sequence to another one that is already in the hash table
           else:
               sequences[sequence] += "_" + seq_record.id

           print sequence

       trans=translate( sequence )




   # Write the clean sequences

   # Create a file in the same directory where you ran this script
   output_file = open("clear_" + fastq_file, "w+")
   # Just read the hash table and write on the file as a fasta format
   for sequence in sequences:
           output_file.write("@" + sequences[sequence] +"\n" + sequence + "\n" + trans +"\n")

   output_file.close()

   print("\n YOUR SEQUENCES ARE CLEAN!!!\nPlease check clear_" + fastq_file + " on the same repository than " + rep + "\n")

我用这个将它翻译成氨基酸序列：

def translate( sequ ):
"""Return the translated protein from 'sequence' assuming +1 reading frame"""

   gencode = {
   'ATA':'Ile', 'ATC':'Ile', 'ATT':'Ile', 'ATG':'Met',
   'ACA':'Thr', 'ACC':'Thr', 'ACG':'Thr', 'ACT':'Thr',
   'AAC':'Asn', 'AAT':'Asn', 'AAA':'Lys', 'AAG':'Lys',
   'AGC':'Ser', 'AGT':'Ser', 'AGA':'Arg', 'AGG':'Arg',
   'CTA':'Leu', 'CTC':'Leu', 'CTG':'Leu', 'CTT':'Leu',
   'CCA':'Pro', 'CCC':'Pro', 'CCG':'Pro', 'CCT':'Pro',
   'CAC':'His', 'CAT':'His', 'CAA':'Gln', 'CAG':'Gln',
   'CGA':'Arg', 'CGC':'Arg', 'CGG':'Arg', 'CGT':'Arg',
   'GTA':'Val', 'GTC':'Val', 'GTG':'Val', 'GTT':'Val',
   'GCA':'Ala', 'GCC':'Ala', 'GCG':'Ala', 'GCT':'Ala',
   'GAC':'Asp', 'GAT':'Asp', 'GAA':'Glu', 'GAG':'Glu',
   'GGA':'Gly', 'GGC':'Gly', 'GGG':'Gly', 'GGT':'Gly',
   'TCA':'Ser', 'TCC':'Ser', 'TCG':'Ser', 'TCT':'Ser',
   'TTC':'Phe', 'TTT':'Phe', 'TTA':'Leu', 'TTG':'Leu',
   'TAC':'Tyr', 'TAT':'Tyr', 'TAA':'STOP', 'TAG':'STOP',
   'TGC':'Cys', 'TGT':'Cys', 'TGA':'STOP', 'TGG':'Trp'}

   return ''.join(gencode.get(sequ[3*i:3*i+3],'X') for i in range(len(sequ)//3))

结果不是我的预期：

@SRR797221.3
TCAGCCGCGCAGTAGTTAGCACAAGTAGTACGATACAAGAACACTATTTGTAAGTCTAAGGCATTGGCCGCTCGTCTGAGACTGCCAAGGCACACAGGGAGTAGNGNN
SerAlaAlaGlnValValProLeuSerSerValProAlaThrProThrProSerAsnAsnAlaAlaArgLeuArgLeuProArgHisThrGlyValGlu
@SRR797221.4
TCAGCCGCGCAGGTAGTTCCGTTATCATCAGTACCAGCAACTCCAACTCCATCCAACAATGCCGCTCGTCTGAGACTGCCAAGGCACACAGGAGTAGAG
SerAlaAlaGlnValValProLeuSerSerValProAlaThrProThrProSerAsnAsnAlaAlaArgLeuArgLeuProArgHisThrGlyValGlu
@SRR797221.2
TCAGCCGCGCAGGTTCTTGGTAACGGAACGCGCGTTAGACTTAAGACCAGTGAATGGAGCCACCATTGGCCGCTCGTCTGAGACTGCCCAAAGGGCACACAGGGGNGTAGNGN
SerAlaAlaGlnValValProLeuSerSerValProAlaThrProThrProSerAsnAsnAlaAlaArgLeuArgLeuProArgHisThrGlyValGlu
@SRR797221.1
TCAGCCGCGCAGGTAGATTAAGGATCAACGGTTCCTTGGCTCGCAAGTCAATTGGCCGCTCGTCTGAGACTGCCAAGGCACACAGGGAGTAGNG
SerAlaAlaGlnValValProLeuSerSerValProAlaThrProThrProSerAsnAsnAlaAlaArgLeuArgLeuProArgHisThrGlyValGlu

首先，你可以看到序列id没有像原始文件那样从1到4排序，而且它还为其他三个序列重复相同的第4个id翻译！

Answer 1

回答你的两个问题

序列id不像原始文件那样从1到4排序您正在使用未分类的字典。

常规Python字典以任意方式迭代键/值对顺序。

https://docs.python.org/3.1/whatsnew/3.1.html

您可以按值对字典进行排序，请参阅此处获取建议：Sort a Python dictionary by value或使用排序字典，请参阅上面的链接

它为其他三个序列重复相同的第4个id翻译

您正在为每个序列分配已翻译的序列trans=translate( sequence )，但您没有将trans存储在特定于您的ID的字典或列表中，而是为每个条目分配trans 。尝试使用单独的字典，将翻译的序列与序列ID一起存储。

用python脚本翻译dna

1 个答案: