我想做三件事:
1) Print out the ID for each sequence
2) Find a particular motif in a sequence, print it out if it exists
3) Print out the index location for the motif in the sequence
Sequence.fasta文件示例:
>sp|Q12955|ANK3_HUMAN Ankyrin-3 OS=Homo sapiens GN=ANK3 PE=1 SV=3
MAHAASQLKKNRDLEINAEEEPEKKRKHRKRSRDRKKKSDANASYLRAARAGHLEKALDY
IKNGVDINICNQNGLNALHLASKEGHVEVVSELLQREANVDAATKKGNTALHIASLAGQA
>sp|Q16659|MK06_HUMAN Mitogen-activated protein kinase 6 OS=Homo sapiens GN=MAPK6 PE=1 SV=1
MAEKFESLMNIHGFDLGSRYMDLKPLGCGGNGLVFSAVDNDCDKRVAIKKIVLTDPQSVK
HALREIKIIRRLDHDNIVKVFEILGPSGSQLTDDVGSLTELNSVYIVQEYMETDLANVLE
QGPLLEEHARLFMYQLLRGLKYIHSANVLHRDLKPANLFINTEDLVLKIGDFGLARIMDP
>sp|Q7Z7A1|CNTRL_HUMAN Centriolin OS=Homo sapiens GN=CNTRL PE=1 SV=2
MKKGSQQKIFKHLQQPSSSHSPIPSSMSNMRSRSLSPLIGSETLPFHSGGQWCEQVEIAD
ENNMLLDYQDHKGADSHAGVRYITEALIKKLTKQDNLALIKSLNLSLSKDGGKKFKYIEN
LEKCVKLEVLNLSYNLIGKIEKLDKLLKLRELNLSYNKISKIEGIENMCNLQKLNLAGNE
在这个文件中,我想找到以下图案(序列中可以有多个相同的图案)作为例子:
MAH..S
KK..D
FES.MN
K..QQ
所以输出应该是:
ID = Q12955
Motif = MAH..S
Location =[0] to [4]
Motif = KK..D
Location = [8] to [12]
ID = Q16659
Motif = FES.MN
Location = [4] to [9]
ID = Q7Z7A1
Motif = K..QQ
Location = [1] to [6]
Location = [10] to [14]
到目前为止代码:
要查找ID:
f=open('pr_seq.fasta','r')
for idLine in f:
if '>' in idLine:
lineSplit = idLine.split('|')
ID = lineSplit[1]
print ID
要找到序列中的图案:
f=open('pr_seq.fasta','r')
pr=[]
for motLine in f:
if motLine[0]=='>':
pr=motLine.split("\n")[1]
else:
try:
pr+=motLine.strip()
except:
pr+=motLine.strip()
print ("PROTEIN SEQUENCE")
print
print (pr)
print
查找主题的索引位置:
motif= ['N.E.K..N', 'N.Y....E', 'S...D.PL', 'S..SS','S.S..S', 'F.FP']
indices=len(pr)
index=0
for a in motif:
if re.findall(a,pr):
print a
mi = pr.index(a)
答案 0 :(得分:0)
既然你解释过没有换行符,那就去做grep:
grep MAH..S Sequence.fasta | grep -bo MAH..S
0:MAHAAS
grep KK..D Sequence.fasta | grep -bo KK..D
8:KKNRD
35:KKKSD
grep FES.MN Sequence.fasta | grep -bo FES.MN
4:FESLMN
grep K..QQ Sequence.fasta | grep -bo K..QQ
2:KGSQQ
10:KHLQQ
如果允许搜索模式两次,则获取如下附加信息:
grep -B1 K..QQ Sequence.fasta | awk -F"|" 'NR==1{print $2}'
Q7Z7A1
通过将模式的长度添加到位置来获得范围是微不足道的。
实际上使用模块而不是使用grep。我没有注意到你的问题是用Python标记的。否则执行grep的subprocess.call()。在Python中它将是:
import re
with open('Sequence.fasta') as f:
lines = f.readlines()
for line in lines:
m = re.match('MAH..S', line)
if not m:
continue
print(m.start(), m.group())
获取正确的格式是微不足道的,我留给您。 不匹配换行符,但你说没有换行符。