如果之前有人问我,我很抱歉,但我甚至不知道该搜索什么。
我刚刚开始学习python,而我目前正在编写我的第一个程序。这个想法是识别蛋白质序列中的所有开放阅读框架,对于非生物学家来说,这意味着识别所有出现的" M ... *"在一个字符串中。
这是我到目前为止所做的,它几乎可以正常工作,但每次打印都会重复打印而不是跳到下一个" M ..."。
# calculates amino acid sequence from nucleotide sequence
protein = nucleotide_seq.transcribe().translate()
print("5'3' Frame 1: \n" + protein)
# Calculates all open reading frames in protein sequence
for n in range(len(protein)):
met = protein.find("M", n)
stop = protein.find("*", met)
orf = protein[met:stop]
print("Open reading frame starting at residue " + str(met+1) + " : " + orf)
nextmet = protein.find("M", stop)
n += nextmet
蛋白质示例:
DIMGYF * GLTGSR * VLSSGWIRAQSCTECG * SSEAGVEVRGVRQTDRHSQPARSAV * SELQILFSFHLLSNCPELAPVAPGLVFRECPESLVSSRPREESPAAQALLTAAESSGTHAPAGGSRRAAAAAKNFPGWEDRRQVAESRSQLLQAFPAS * ASPRR * RPEGGGEPRKRRRTCAQLRSHRLLNLGEREPRLPGAPSP * QRRRGQVVGVRAAKTRRRPATAGSALIRSAGRAAALGSEFACGLRGTAAHEERSVSDRDFSKPGSARESTSKSAGGILINPALPGASW * GGRSGDDSQRVRALLEKLSLSKAPGGAGVPRLPQPCCGPETCARSPN * PHVK * RTVL * LQRWKRPSMTMPSTPRSSRPRADLMATVTPRS *
答案 0 :(得分:0)
n += nextmet
无法执行您想要的操作,因为当控件返回到for
循环的顶部时,n
会重置为范围中的下一个数字。因此,您可以使用for
循环,而不是使用while
循环。例如,
maxloop = len(protein)
n = 0
while n < maxloop:
met = protein.find("M", n)
if met == -1:
break
#etc
n = nextmet + 1
我把if
语句放在那里,因为如果find
找不到它的目标,它会返回-1。
这是一个更完整的演示,现在您已经为我们提供了一些数据可供使用。
protein = '''DIMGYF*GLTGSR*VLSSGWIRAQSCTECG*SSEAGVEVRGVRQTDRHSQPARSAV*
SELQILFSFHLLSNCPELAPVAPGLVFRECPESLVSSRPREESPAAQALLTAAESSGTHAPAGGSRRAAAAA
KNFPGWEDRRQVAESRSQLLQAFPAS*ASPRR*RPEGGGEPRKRRRTCAQLRSHRLLNLGEREPRLPGAPSP
*QRRRGQVVGVRAAKTRRRPATAGSALIRSAGRAAALGSEFACGLRGTAAHEERSVSDRDFSKPGSARESTS
KSAGGILINPALPGASW*GGRSGDDSQRVRALLEKLSLSKAPGGAGVPRLPQPCCGPETCARSPN*PHVK*
RTVL*LQRWKRPSMTMPSTPRSSRPRADLMATVTPRS*'''
#Get rid of newlines
protein = protein.replace('\n', '')
print("5'3' Frame 1:\n{0}\n".format(protein))
maxloop = len(protein)
n = 0
while n < maxloop:
met = protein.find("M", n)
if met == -1:
break
stop = protein.find("*", met)
if stop == -1:
print('Error: no * found for frame starting at residue', met + 1)
break
orf = protein[met:stop]
print("Open reading frame starting at residue", met + 1, ":", orf)
n = stop + 1
<强>输出强>
5'3' Frame 1:
DIMGYF*GLTGSR*VLSSGWIRAQSCTECG*SSEAGVEVRGVRQTDRHSQPARSAV*SELQILFSFHLLSNCPELAPVAPGLVFRECPESLVSSRPREESPAAQALLTAAESSGTHAPAGGSRRAAAAAKNFPGWEDRRQVAESRSQLLQAFPAS*ASPRR*RPEGGGEPRKRRRTCAQLRSHRLLNLGEREPRLPGAPSP*QRRRGQVVGVRAAKTRRRPATAGSALIRSAGRAAALGSEFACGLRGTAAHEERSVSDRDFSKPGSARESTSKSAGGILINPALPGASW*GGRSGDDSQRVRALLEKLSLSKAPGGAGVPRLPQPCCGPETCARSPN*PHVK*RTVL*LQRWKRPSMTMPSTPRSSRPRADLMATVTPRS*
Open reading frame starting at residue 3 : MGYF
Open reading frame starting at residue 358 : MTMPSTPRSSRPRADLMATVTPRS
答案 1 :(得分:0)
import re
protein = "DIMGYF*GLTGSR*VLSSGWIRAQSCTECG*SSEAGVEVRGVRQTDRHSQPARSAV*SELQILFSFHLLSNCPELAPVAPGLVFRECPESLVSSRPREESPAAQALLTAAESSGTHAPAGGSRRAAAAAKNFPGWEDRRQVAESRSQLLQAFPAS*ASPRR*RPEGGGEPRKRRRTCAQLRSHRLLNLGEREPRLPGAPSP*QRRRGQVVGVRAAKTRRRPATAGSALIRSAGRAAALGSEFACGLRGTAAHEERSVSDRDFSKPGSARESTSKSAGGILINPALPGASW*GGRSGDDSQRVRALLEKLSLSKAPGGAGVPRLPQPCCGPETCARSPN*PHVK*RTVL*LQRWKRPSMTMPSTPRSSRPRADLMATVTPRS*"
for match in re.finditer('M([^\*]+)\*', protein):
print match.start()+1, match.group()
>3 MGYF*
>358 MTMPSTPRSSRPRADLMATVTPRS*
如果M...M..*
不是有效结果,您可以将M
添加到禁止的字符:M([^\*M]+)\*
。
>3 MGYF*
>374 MATVTPRS*
答案 2 :(得分:0)
您接收重复的原因是由于您正在使用for循环并将n递增1而不是将n移动到前一帧的末尾:
# Calculates all open reading frames in protein sequence
n = 0
length = len(protein)
while n < length:
met = protein.find("M", n)
stop = protein.find("*", met)
if stop == -1: # Stop is beyond boundary of protein
break
orf = protein[met:stop]
print("Open reading frame starting at residue " + str(met+1) + " : " + orf)
n = stop + 1