Question

我有一个正则表达式PROTO\s*\{(\n*\s*\w+,)+\n*\s*\}来匹配文本文件，如下所示

PROTO {
    product1,
    product2,
    product3,
    product4,
    product5,
    product6,
}

SAVE_LOG: True

SUMMARY: True

如何将上述正则表达式用于列表中的输出，如

['product1', 'product2', 'product3', 'product4', 'product5', 'product6']

Answer 1

这不需要正则表达式，你可以通过简单的字符串函数实现你想要的。

with open('path/to/file.txt','r') as fp:
    product_list = []
    for line in fp.readlines():
        if line.strip()[:5] == 'PROTO':
            append_bool = True
        elif append_bool and line.find('}')>=0:
            append_bool = False
        if append_bool:
            product_list.append(line.strip().replace(',',''))

Answer 2

这将获得您想要的数组：

import itertools
protos = re.findall(r'PROTO\s*\{(.*?)\}', data, flags=re.DOTALL)
lines = [re.findall(r'(\w+),', x) for x in protos]
products = list(itertools.chain.from_iterable(lines))

Answer 3

如果您能够安装较新的regex模块（支持\G修饰符），您可以提出......喜欢（demo on regex101.com）：

(?:^PROTO\s*\{\s+|(?!\A)\G\s*)([^,\n\r]+),

在Python中，这将是：

import regex as re

string = """
PROTO {
    product1,
    product2,
    product3,
    product4,
    product5,
    product6,
}

SAVE_LOG: True

SUMMARY: True
"""

rx = re.compile(r"""
        (?:^PROTO\s*\{\s+   # look for PROTO at the beginning of the line,
                            # followed by whitespace and {
            |               # OR
            (?!\A)\G\s*)    # start at the previous match (make sure it's not the start)
        ([^,\n\r]+),        # look for sth. that is not a comma or newline
        """, re.VERBOSE|re.MULTILINE)

matches = rx.findall(string)
print matches
# ['product1', 'product2', 'product3', 'product4', 'product5', 'product6']

这样做的好处是只有一个正则表达式（另外编译），因此它可能更快。

如何在Python代码中实现上面提到的正则表达式

3 个答案: