我现在正在尝试创建一个可以翻译DNA序列然后将它们相互比较以删除重复的工具!
我用这个脚本来阅读我的fastq
文件:
def sequence_cleaner(fastq_file, min_length=0, por_n=100):
# Create our hash table to add the sequences
sequences={}
# Using the Biopython fastq parse we can read our fastq input
for seq_record in SeqIO.parse(fastq_file, "fastq"):
# Take the current sequence
sequence = str(seq_record.seq).upper()
# Check if the current sequence is according to the user parameters
if (len(sequence) >= min_length and
(float(sequence.count("N"))/float(len(sequence)))*100 <= por_n):
# If the sequence passed in the test "is it clean?" and it isn't in the
# hash table, the sequence and its id are going to be in the hash
if sequence not in sequences:
sequences[sequence] = seq_record.id
# If it is already in the hash table, we're just gonna concatenate the ID
# of the current sequence to another one that is already in the hash table
else:
sequences[sequence] += "_" + seq_record.id
print sequence
trans=translate( sequence )
# Write the clean sequences
# Create a file in the same directory where you ran this script
output_file = open("clear_" + fastq_file, "w+")
# Just read the hash table and write on the file as a fasta format
for sequence in sequences:
output_file.write("@" + sequences[sequence] +"\n" + sequence + "\n" + trans +"\n")
output_file.close()
print("\n YOUR SEQUENCES ARE CLEAN!!!\nPlease check clear_" + fastq_file + " on the same repository than " + rep + "\n")
我用这个将它翻译成氨基酸序列:
def translate( sequ ):
"""Return the translated protein from 'sequence' assuming +1 reading frame"""
gencode = {
'ATA':'Ile', 'ATC':'Ile', 'ATT':'Ile', 'ATG':'Met',
'ACA':'Thr', 'ACC':'Thr', 'ACG':'Thr', 'ACT':'Thr',
'AAC':'Asn', 'AAT':'Asn', 'AAA':'Lys', 'AAG':'Lys',
'AGC':'Ser', 'AGT':'Ser', 'AGA':'Arg', 'AGG':'Arg',
'CTA':'Leu', 'CTC':'Leu', 'CTG':'Leu', 'CTT':'Leu',
'CCA':'Pro', 'CCC':'Pro', 'CCG':'Pro', 'CCT':'Pro',
'CAC':'His', 'CAT':'His', 'CAA':'Gln', 'CAG':'Gln',
'CGA':'Arg', 'CGC':'Arg', 'CGG':'Arg', 'CGT':'Arg',
'GTA':'Val', 'GTC':'Val', 'GTG':'Val', 'GTT':'Val',
'GCA':'Ala', 'GCC':'Ala', 'GCG':'Ala', 'GCT':'Ala',
'GAC':'Asp', 'GAT':'Asp', 'GAA':'Glu', 'GAG':'Glu',
'GGA':'Gly', 'GGC':'Gly', 'GGG':'Gly', 'GGT':'Gly',
'TCA':'Ser', 'TCC':'Ser', 'TCG':'Ser', 'TCT':'Ser',
'TTC':'Phe', 'TTT':'Phe', 'TTA':'Leu', 'TTG':'Leu',
'TAC':'Tyr', 'TAT':'Tyr', 'TAA':'STOP', 'TAG':'STOP',
'TGC':'Cys', 'TGT':'Cys', 'TGA':'STOP', 'TGG':'Trp'}
return ''.join(gencode.get(sequ[3*i:3*i+3],'X') for i in range(len(sequ)//3))
结果不是我的预期:
@SRR797221.3
TCAGCCGCGCAGTAGTTAGCACAAGTAGTACGATACAAGAACACTATTTGTAAGTCTAAGGCATTGGCCGCTCGTCTGAGACTGCCAAGGCACACAGGGAGTAGNGNN
SerAlaAlaGlnValValProLeuSerSerValProAlaThrProThrProSerAsnAsnAlaAlaArgLeuArgLeuProArgHisThrGlyValGlu
@SRR797221.4
TCAGCCGCGCAGGTAGTTCCGTTATCATCAGTACCAGCAACTCCAACTCCATCCAACAATGCCGCTCGTCTGAGACTGCCAAGGCACACAGGAGTAGAG
SerAlaAlaGlnValValProLeuSerSerValProAlaThrProThrProSerAsnAsnAlaAlaArgLeuArgLeuProArgHisThrGlyValGlu
@SRR797221.2
TCAGCCGCGCAGGTTCTTGGTAACGGAACGCGCGTTAGACTTAAGACCAGTGAATGGAGCCACCATTGGCCGCTCGTCTGAGACTGCCCAAAGGGCACACAGGGGNGTAGNGN
SerAlaAlaGlnValValProLeuSerSerValProAlaThrProThrProSerAsnAsnAlaAlaArgLeuArgLeuProArgHisThrGlyValGlu
@SRR797221.1
TCAGCCGCGCAGGTAGATTAAGGATCAACGGTTCCTTGGCTCGCAAGTCAATTGGCCGCTCGTCTGAGACTGCCAAGGCACACAGGGAGTAGNG
SerAlaAlaGlnValValProLeuSerSerValProAlaThrProThrProSerAsnAsnAlaAlaArgLeuArgLeuProArgHisThrGlyValGlu
首先,你可以看到序列id
没有像原始文件那样从1到4排序,而且它还为其他三个序列重复相同的第4个id翻译!
答案 0 :(得分:0)
回答你的两个问题
常规Python字典以任意方式迭代键/值对 顺序。
https://docs.python.org/3.1/whatsnew/3.1.html
您可以按值对字典进行排序,请参阅此处获取建议:Sort a Python dictionary by value或使用排序字典,请参阅上面的链接
您正在为每个序列分配已翻译的序列trans=translate( sequence )
,但您没有将trans
存储在特定于您的ID的字典或列表中,而是为每个条目分配trans
。尝试使用单独的字典,将翻译的序列与序列ID一起存储。