在文件中查找字符串及其位置

时间:2015-04-13 21:07:58

标签: python-2.7

我想做三件事:

1) Print out the ID for each sequence
2) Find a particular motif in a sequence, print it out if it exists
3) Print out the index location for the motif in the sequence

Sequence.fasta文件示例:

>sp|Q12955|ANK3_HUMAN Ankyrin-3 OS=Homo sapiens GN=ANK3 PE=1 SV=3
MAHAASQLKKNRDLEINAEEEPEKKRKHRKRSRDRKKKSDANASYLRAARAGHLEKALDY
IKNGVDINICNQNGLNALHLASKEGHVEVVSELLQREANVDAATKKGNTALHIASLAGQA

>sp|Q16659|MK06_HUMAN Mitogen-activated protein kinase 6 OS=Homo sapiens GN=MAPK6 PE=1 SV=1
MAEKFESLMNIHGFDLGSRYMDLKPLGCGGNGLVFSAVDNDCDKRVAIKKIVLTDPQSVK
HALREIKIIRRLDHDNIVKVFEILGPSGSQLTDDVGSLTELNSVYIVQEYMETDLANVLE
QGPLLEEHARLFMYQLLRGLKYIHSANVLHRDLKPANLFINTEDLVLKIGDFGLARIMDP

>sp|Q7Z7A1|CNTRL_HUMAN Centriolin OS=Homo sapiens GN=CNTRL PE=1 SV=2
MKKGSQQKIFKHLQQPSSSHSPIPSSMSNMRSRSLSPLIGSETLPFHSGGQWCEQVEIAD
ENNMLLDYQDHKGADSHAGVRYITEALIKKLTKQDNLALIKSLNLSLSKDGGKKFKYIEN
LEKCVKLEVLNLSYNLIGKIEKLDKLLKLRELNLSYNKISKIEGIENMCNLQKLNLAGNE

在这个文件中,我想找到以下图案(序列中可以有多个相同的图案)作为例子:

MAH..S
KK..D
FES.MN
K..QQ

所以输出应该是:

ID = Q12955
Motif = MAH..S
Location =[0] to [4]
Motif = KK..D
Location = [8] to [12]

ID = Q16659
Motif = FES.MN
Location = [4] to [9]

ID = Q7Z7A1
Motif = K..QQ
Location = [1] to [6]
Location = [10] to [14]
到目前为止

代码:

要查找ID:

f=open('pr_seq.fasta','r')

for idLine in f:
    if '>' in idLine:
        lineSplit = idLine.split('|')
        ID = lineSplit[1]
        print ID

要找到序列中的图案:

f=open('pr_seq.fasta','r') 
pr=[]

for motLine in f:
    if motLine[0]=='>':
        pr=motLine.split("\n")[1]

    else:
        try:
            pr+=motLine.strip()
        except:
            pr+=motLine.strip()

    print ("PROTEIN SEQUENCE")      
    print
    print (pr)
    print

查找主题的索引位置:

motif= ['N.E.K..N', 'N.Y....E', 'S...D.PL', 'S..SS','S.S..S', 'F.FP'] 
indices=len(pr)
index=0

for a in motif:
    if re.findall(a,pr):
        print a
        mi = pr.index(a)

1 个答案:

答案 0 :(得分:0)

既然你解释过没有换行符,那就去做grep:

grep MAH..S Sequence.fasta | grep -bo MAH..S
0:MAHAAS

grep KK..D Sequence.fasta | grep -bo KK..D
8:KKNRD
35:KKKSD

grep FES.MN Sequence.fasta | grep -bo FES.MN
4:FESLMN

grep K..QQ Sequence.fasta | grep -bo K..QQ
2:KGSQQ
10:KHLQQ

如果允许搜索模式两次,则获取如下附加信息:

grep -B1 K..QQ Sequence.fasta | awk -F"|" 'NR==1{print $2}'
Q7Z7A1

通过将模式的长度添加到位置来获得范围是微不足道的。

实际上使用模块而不是使用grep。我没有注意到你的问题是用Python标记的。否则执行grep的subprocess.call()。在Python中它将是:

import re
with open('Sequence.fasta') as f:
    lines = f.readlines()

for line in lines:
    m = re.match('MAH..S', line)
    if not m:
        continue
    print(m.start(), m.group())

获取正确的格式是微不足道的,我留给您。 匹配换行符,但你说没有换行符。