匹配python中可能的最长字符集

时间:2017-04-27 07:56:54

标签: python

字符串中包含字符0123456789AB。 我有一个正则表达式:

([^1368A]+|[^2479B]+|[^0358A]+|[^1469B]+|[^0257A]+|[^1368B]+|[^02479]+|[^1358A]+|[^2469B]+|[^0357A]+|[^1468B]+|[^02579]+)

首先匹配而不是最长的问题。如何使它匹配python中最长的?我不希望在正则表达式中有可能。 编辑:我需要找到所有比赛。优选具有成功模式的索引。 输入示例:

66666A00666160666106606610666610A60661606661606066160660616A00666160666160606610666610A60661606661066066160660616A00666160666160606616066610A60661606661606066106660616A00666106666160606616066610A60661066661606066160660616A0000000000000666606A100666160666160606616066616060661606661606066106666106606610666616060661606661606066106666160606610666610660661066661606066106666160606610666616060661606661606066106666160606616066616060661606661066066160666160606616066610660661606661066066160666106606616066616060661606661066066160666106606616066616060661606661606066160666160606616066610660661066661606066106666160606610666616060661066661606066160666160606616066616060666066616060666066616060666066616060666066616060666066660606666A

另一个例子:

027027240270272402702724027027240270272402702724027027240270272402702724027027240270272402702724027027240270272402702724027027240270272402702724027027240270272402702724027027240270272402702724027027240270272402702724027027240270272402702724027027240270272402702724027027240270272402702724027027240270272402702724027027240270272427BB232B0738310A5320738310A53202735A8310A53202735A8310A53202735A8310A53202735A8310A532249A540249A540249A540249A540792A54002402702724792A540

输出示例:

'470470574704705747047057570570574704705727027B5747047057570570574704705727027B5747047057570570574704705727027B5747047057570570574704705727027B5747047057570570574704705727027B5747047057570570574704705727027B5747047057570570574704705727027B5747047057570570574704705727027B5747047057570570574704705727027B57470470574704705747047057B2727875377AA0577AA0577AA0577AA0577AA0577AA059959959959952257777225'
('1368A','470470574704705747047057570570574704705727027B5747047057570570574704705727027B5747047057570570574704705727027B5747047057570570574704705727027B5747047057570570574704705727027B5747047057570570574704705727027B5747047057570570574704705727027B5747047057570570574704705727027B5747047057570570574704705727027B57470470574704705747047057B2727'),('','8'),('1468B','75377AA0577AA0577AA0577AA0577AA0577AA059959959959952257777225')

补充:目前我使用此代码:

import sys,re
from midplay import MidiFile,NoteOn
from collections import deque
notes=("C","C#","D","Eb","E","F","F#","G","G#","A","Bb","B")
noteshex=('0','1','2','3','4','5','6','7','8','9','A','B')
major=lambda x:((x)%12,(x+2)%12,(x+4)%12,(x+5)%12,(x+7)%12,(x+9)%12,(x+11)%12,)
minor=lambda x:((x)%12,(x+2)%12,(x+3)%12,(x+5)%12,(x+7)%12,(x+8)%12,(x+10)%12,)
nomajor=lambda x:{(x+1)%12,(x+3)%12,(x+6)%12,(x+8)%12,(x+10)%12}
nominor=lambda x:{(x+1)%12,(x+4)%12,(x+6)%12,(x+9)%12,(x+11)%12}
nomajortonelist=[re.compile('([^'+''.join([noteshex[note] for note in nomajor(tonality)])+']+)') for tonality in range(12)]
nominortonelist=nomajortonelist[3:]+nomajortonelist[:3]
if len(sys.argv)!=2:
    sys.exit('usage: py tonalitydetect.py [C:\path]filename.mid')
midi=MidiFile(sys.argv[1])
for num, track in enumerate(midi):
    print('Track:',num,'messages:',len(track))
    channelnotes=['','','','','','','','','','','','','','','','']
    channeltonality=[deque(),deque(),deque(),deque(),deque(),deque(),deque(),deque(),deque(),deque(),deque(),deque(),deque(),deque(),deque(),deque()]
    for msg in track:
        if isinstance(msg,NoteOn):
            channelnotes[msg.channel]+=(noteshex[msg.note%12])
    for chnum,channel in enumerate(channelnotes):
        tomatch=[channel]
        matches=[]
        while ''.join(tomatch)!='':
            curchanmaxmatch=deque()
            for string in tomatch:
                for exp in nomajortonelist:
                    curchanmaxmatch.append((exp,max(exp.findall(string)+[''], key=len)))
            matches.append(max(curchanmaxmatch+deque([('','',)]), key=lambda x:len(x[1])))
            newmatch=[]
            found=0
            for x in tomatch:
                if not found:
                    match=x.split(matches[-1][1],1)
                    if len(match)>1:
                        found=1
                    newmatch.extend(match)
                else:
                    newmatch.append(x)
            tomatch=[x for x in newmatch if x!='']
        matches=sorted(matches, key=lambda x:len(x[1]))
        toseek=channel
        while len(matches):
            for num,match in enumerate(matches):
                if not toseek.find(match[1]):
                    channeltonality[chnum].append(match)
                    toseek=toseek[len(match[1]):]
                    del matches[num]
                    break
    for chnum,channel in enumerate(channeltonality):
        print('Channel',chnum,':',[notes[nomajortonelist.index(x[0])]+' major, '+notes[nominortonelist.index(x[0])]+' minor' for x in channel])

1 个答案:

答案 0 :(得分:1)

编辑:请参阅下文,了解显示最长匹配位置的解决方案。

针对您的问题,最接近的内置工具是re.findall(pattern,string,flags=0)'返回字符串中所有非重叠的模式匹配项,作为字符串列表。'

您的情况的一个问题是不同的匹配可以重叠---但findall仅返回非重叠匹配。例如,输入字符串2B001AA包含两个不同的匹配项:2B00001AAre.findall函数会找到并返回第一个匹配2B00。然后,它从中断处继续---仅返回1AA作为下一场比赛。

您可以通过将正则表达式分解为逐个匹配的部分来解决此问题:

import re
patterns=[
    r'[^1368A]+', r'[^2479B]+', r'[^0358A]+', r'[^1469B]+',
    r'[^0257A]+', r'[^1368B]+', r'[^02479]+', r'[^1358A]+',
    r'[^2469B]+', r'[^0357A]+', r'[^1468B]+', r'[^02579]+'
]

def match_patterns(string):
    for pattern in patterns:
        for match in re.findall(pattern,string):
            yield match

函数match_pattern返回所有匹配项(但不总是按顺序)。在python3中,您可以更短地编写此函数:

def match_patterns(string):
    for pattern in patterns:
        yield from re.findall(pattern,string)

在任何情况下,您都可以使用内置函数max提取最长匹配:

def find_longest_match(string):
    return max(match_patterns(string), key=len)

print(find_longest_match('12A34B32A43')) # prints: A34B3

如果您还想要最长匹配的位置,请使用 re.finditer(pattern, string, flags=0)'返回一个迭代器,在字符串中的RE模式的所有非重叠匹配上产生match objects对于每个返回的match,{{1} }给出了开始位置和match.start()匹配文本。

match.group(0)