正则表达帮助

时间:2011-04-23 14:09:01

标签: python regex python-3.x

我正在尝试在Python 3中创建一个正则表达式,该正则表达式匹配由未知数量的字符分隔的7个字符(例如> AB0012),然后匹配另外6个字符(例如aaabbb或bbbaaa)。我的输入字符串可能如下所示:

>AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

这是我提出的正则表达式:

matches = re.findall(r'(>.{7})(aaabbb|bbbaaa)', mystring)  
print(matches)

我想要产生的输出看起来像这样:

[('>CD00192', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990', 'aaabbb')]

我阅读了Python文档,但是我找不到如何匹配正则表达式的两个部分之间的未知距离。是否有某种通配符可以让我完成我的正则表达式?在此先感谢您的帮助!

编辑:
如果我在代码中使用*?,请执行以下操作:

mystring = str(input("Paste promoters here: "))
matches = re.findall(r'(>.{7})*?(aaabbb|bbbaaa)', mystring)
print(matches)

我的输出如下:
[('> CD00192','aaabbb'),('','bbbaaa'),('','aaabbb')]

*列表中的第二项和第三项分别缺少> CD00192和> ZP01990。如何让正则表达式在列表中包含这些字符?

4 个答案:

答案 0 :(得分:5)

这是一种非正则表达式方法。拆分“>” (你的数据将从第2个元素开始),然后因为你不关心这7个字符是什么,所以从第8个字符开始检查直到第14个字符。

>>> string=""" AB0012xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa>CD00192aaabbblllllllllllllllllllllyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyybbbaaayyyyyyyyyyyyyyyyyyyy>ZP0199000000000000000000012mmmm3m4mmmmmmmmxxxxxxxxxxxxxxxxxaaabbbaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa""" 
>>> for i in string.split(">")[1:]:
...   if i[7:13] in ["aaabbb","bbbaaa"]:
...     print ">" + i[:13]
...
>CD00192aaabbb

答案 1 :(得分:1)

我有一个代码也提供了职位。

以下是此代码的简单版本:

import re
from collections import OrderedDict

ch = '>AB0012xxxxaaaaaaaaaaaa'\
     '>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
     '>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
     '>QD1547zzzzzzzzjjjiii'\
     '>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'


print ch,'\n'

regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')

dic = OrderedDict()


# Finding the result
for mat in regx.finditer(ch):
    chunk,head = mat.groups()
    headstart = mat.start()
    dic[(headstart,head)] = [(headstart+six.start(),six.start(),six.group())
                             for six in rag.finditer(chunk)]


# Diplaying the result
for (headstart,head),li in dic.iteritems():
    print '{:>10} {}'.format(headstart,head)
    for x in li:
        print '{0[0]:>10} {0[1]:>6} {0[2]}'.format(x)

结果

>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa 

        24 CD00192
        31      8 aaabbb
        41     18 bbbaaa
        52     29 bbbaaa
        62     39 aaabbb
        69 ZP01990
        95     27 aaabbb
       136 SE45789
       148     13 aaabbb
       172     37 bbbaaa

使用生成器以功能方式使用相同的代码:

import re
from itertools import imap
from collections import OrderedDict

ch = '>AB0012xxxxaaaaaaaaaaaa'\
     '>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
     '>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
     '>QD1547zzzzzzzzjjjiii'\
     '>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'


print ch,'\n'

regx = re.compile('((?<=>)(.{7})[^>]*(?:aaabbb|bbbaaa)[^>]*?)(?=>|\Z)')
rag = re.compile('aaabbb|bbbaaa')

gen = ((mat.groups(),mat.start()) for mat in regx.finditer(ch)) 


dic = OrderedDict(((headstart,head),
                   [(headstart+six.start(),six.start(),six.group())
                    for six in rag.finditer(chunk)])
                  for (chunk,head),headstart in gen)


print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
                '\n'.join(imap('{0[0]:>10} {0[1]:>6} {0[2]}'.format,li))
                for (headstart,head),li in dic.iteritems())

修改

我测量了执行的次数。

对于每个代码,我测量了字典的创建和单独显示。

使用发电机的代码(第二个)显示结果(0.020秒)比另一个(0.148秒)快7.4倍

但令我惊讶的是,使用生成器的代码比其他代码(0.000418秒)多花了47%的时间(0.000718秒)来计算字典。

编辑2

另一种方法:

import re
from collections import OrderedDict
from itertools import imap

ch = '>AB0012xxxxaaaaaaaaaaaa'\
     '>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
     '>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
     '>QD1547zzzzzzzzjjjiii'\
     '>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'


print ch,'\n'


regx = re.compile('((?<=>).{7})|(aaabbb|bbbaaa)')

def collect(ch):
    li = []
    dic = OrderedDict()

    gen = ( (x.start(),x.group(1),x.group(2)) for x in regx.finditer(ch))
    for st,g1,g2 in gen:
        if g1:
            if li:
                dic[(stprec,g1prec)] = li
            li,stprec,g1prec = [],st,g1
        elif g2:
            li.append((st,g2))
    if li:
        dic[(stprec,g1prec)] = li
    return dic


dic = collect(ch)

print '\n'.join('{:>10} {}'.format(headstart,head)+'\n'+\
                '\n'.join(imap('{0[0]:>10}   {0[1]}'.format,li))
                for (headstart,head),li in dic.iteritems())

结果

>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa 

        24 CD00192
        31   aaabbb
        41   bbbaaa
        52   bbbaaa
        62   aaabbb
        69 ZP01990
        95   aaabbb
       136 SE45789
       148   aaabbb
       172   bbbaaa

此代码以0.00040秒计算dic并以0.0321秒显示

编辑3

要回答你的问题,你没有其他可能性,只能将每个当前值保存在'CD00192','ZP01990','SE45789'等名称之下(我不想在变量中说“) “在Python中,因为Python中没有变量。但你可以在名称”下读取“,好像我在变量”中写了

为此,您必须使用 finditer()

以下是此解决方案的代码:

import re

ch = '>AB0012xxxxaaaaaaaaaaaa'\
     '>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb'\
     '>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa'\
     '>QD1547zzzzzzzzjjjiii'\
     '>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa'


print ch,'\n'

regx = re.compile('(>.{7})|(aaabbb|bbbaaa)')

matches = []
for mat in regx.finditer(ch):
    g1,g2= mat.groups()
    if g1:
        head = g1
    else:
        matches.append((head,g2))

print matches

结果

>AB0012xxxxaaaaaaaaaaaa>CD00192aaabbbllyybbbaaayyyuubbbaaaggggaaabbb>ZP0199000012mmmm3m4mmmxxxxaaabbbaaaaaaaaaaaaa>QD1547zzzzzzzzjjjiii>SE457895ffffaaabbbbbbbgjhgjgjhgjhgbbbbbaaa 

[('>CD00192', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'bbbaaa'), ('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>SE45789', 'aaabbb'), ('>SE45789', 'bbbaaa')]

我之前的代码更复杂,因为它们捕获位置并在列表中收集'CD00192','ZP01990','SE45789'等中一个标题的值'aaabbb'和'bbbaaa'。

答案 2 :(得分:0)

可以使用*匹配零个或多个字符,因此a*将匹配"""a""aa"+匹配一个或多个角色。

您可能也希望使用+*来使量词(+?*?)变得懒惰。

有关详细信息,请参阅regular-expressions.info

答案 3 :(得分:0)

试试这个:

>>> r1 = re.findall(r'(>.{7})[^>]*?(aaabbb)', s)  
>>> r2 = re.findall(r'(>.{7})[^>]*?(bbbaaa)', s)  
>>> r1 + r2
[('>CD00192', 'aaabbb'), ('>ZP01990', 'aaabbb'), ('>CD00192', 'bbbaaa'), ('>ZP01990',     'bbbaaa')]