在重复出现的子字符串之间找到字符串?

时间:2019-05-16 13:48:22

标签: python python-3.x

我有一个类似于

的字符串
s = "(test1 or (test2 or test3)) and (test4 and (test6)) and (test7 or test8) and test9"

我正在尝试在()之间提取

['test1 or (test2 or test3)', 'test4 and (test6)', 'test7 or test8']

我尝试过

result = re.search('%s(.*)%s' % ("(", ")"), s).group(1)
result =(s[s.find("(")+1 : s.find(")")])
result = re.search('((.*))', s)

2 个答案:

答案 0 :(得分:2)

您有嵌套括号。这就需要解析,或者如果您不想走那么远,请回到基础知识,逐个字符地进行解析以找到每个组的0嵌套级别。

然后黑客先删除and令牌(如果有的话)。

我为此编写的代码。不短,也不是很复杂,设备齐全,没有多余的库:

s = "(test1 or (test2 or test3)) and (test4 and (test6)) and (test7 or test8) and test9"

nesting_level = 0
previous_group_index = 0

def rework_group(group):
    # not the brightest function but works. Maybe needs tuning
    # that's not the core of the algorithm but simple string operations
    # look for the first opening parenthese, remove what's before
    idx = group.find("(")
    if idx!=-1:
        group = group[idx:]
    else:
        # no parentheses: split according to blanks, keep last item
        group = group.split()[-1]
    return group

result = []

for i,c in enumerate(s):
    if c=='(':
        nesting_level += 1
    elif c==')':
        nesting_level -= 1
        if nesting_level == 0:
            result.append(rework_group(s[previous_group_index:i+1]))
            previous_group_index = i+1

result.append(rework_group(s[previous_group_index:]))

结果:

>>> result
['(test1 or (test2 or test3))',
 '(test4 and (test6))',
 '(test7 or test8)',
 'test9']
>>> 

答案 1 :(得分:0)

如果您确实想为此做一个粗略的解析器,它将看起来像这样。

这使用模式对象的scanner方法,在级别0(通过遇到的左右括号定义级别)时遍历并构建列表。

import re

# Token specification
TEST = r'(?P<TEST>test[0-9]*)'
LEFT_BRACKET = r'(?P<LEFT_BRACKET>\()'
RIGHT_BRACKET = r'(?P<RIGHT_BRACKET>\))'
AND = r'(?P<AND> and )'
OR = r'(?P<OR> or )'

master_pat = re.compile('|'.join([TEST, LEFT_BRACKET, RIGHT_BRACKET, AND, OR]))

s = "(test1 or (test2 or test3)) and (test4 and (test6)) and (test7 or test8) and test9"

def generate_list(pat, text):
    ans = []
    elem = ''
    level = 0
    scanner = pat.scanner(text)
    for m in iter(scanner.match, None):
        # print(m.lastgroup, m.group(), level)
        # keep building elem if nested or not tokens to skip for level=0,1
        if (level > 1 or
          (level == 1 and m.lastgroup != 'RIGHT_BRACKET') or
          (level == 0 and m.lastgroup not in ['LEFT_BRACKET', 'AND'])
        ):
            elem += m.group()
        # if at level 0 we can append
        if level == 0 and elem != '':
            ans.append(elem)
            elem = ''
        # set level
        if m.lastgroup == 'LEFT_BRACKET':
            level += 1
        elif m.lastgroup == 'RIGHT_BRACKET':
            level -= 1
    return ans


generate_list(master_pat, s)
# ['test1 or (test2 or test3)', 'test4 and (test6)', 'test7 or test8', 'test9']

查看scanner的行为:

master_pat = re.compile('|'.join([TEST, LEFT_BRACKET, RIGHT_BRACKET, AND, OR]))
s = "(test1 or (test2 or test3)) and (test4 and (test6)) and (test7 or test8) and test9"

scanner = master_pat.scanner(s)
scanner.match()
# <re.Match object; span=(0, 1), match='('>
_.lastgroup, _.group()
# ('LEFT_BRACKET', '(')
scanner.match()
# <re.Match object; span=(1, 6), match='test1'>
_.lastgroup, _.group()
# ('TEST', 'test1')
scanner.match()
# <re.Match object; span=(6, 10), match=' or '>
_.lastgroup, _.group()
# ('OR', ' or ')
scanner.match()
# <re.Match object; span=(10, 11), match='('>
_.lastgroup, _.group()
# ('LEFT_BRACKET', '(')
scanner.match()
# <re.Match object; span=(11, 16), match='test2'>
_.lastgroup, _.group()
# ('TEST', 'test2')