Question

我正在编写一个python脚本来解析带有以下数据的文件：

// some more entries with different structures.
leaf a include-1 {
 type 1;
 description "test1";
}
leaf b include-2 {
 type string;
 description "test2";
}
// some other entries

我想得到我所有的叶子名称和描述例如：

a test1
b test2

我尝试了以下操作：

 regExStr = '^leaf.*{.*include-.*}$'
 compiled =  re.compile(regExStr, re.DOTALL | re.MULTILINE)
 matched = compiled.search(line)

所以换句话说，我想用leaf开始我的搜索，然后是{后跟include-后跟任何数字字符，后跟任何内容然后以}

结束

因为我使用了re.DOTALL | re.MULTILINE我的.还包括新行。

但我无法获得理想的结果。我在这里缺少什么？

Answer 1

如果格式始终相同，则可以使用dict和str.split：

d = {}
with open("in.txt") as f:
    for line in f:
        if line.startswith("leaf"):
            key = line.split(None,2)[1]
            next(f)
            val = next(f).split()[-1].strip(";\n")
            d[key]=val

for k,v in d.items():
    print(k,v)

输出：

('a', '"test1"')
('b', '"test2"')

Answer 2

使用re.S (=re.DOTALL)修饰符可以使用正则表达式搜索多行。为了能够匹配行的开头，应该使用re.M（多行模式）。这些名字有点棘手，但可以合并。

您可以使用更新的正则表达式获取结果：

p = re.compile(r'^leaf\s+(\S+)[^{]*\{[^}]*\bdescription\s+"([^"]+)"[^}]*}', re.S|re.M)
test_str = "leaf a include-1 {\n type 1;\n description \"test1\";\n}\nleaf b include-2 {\n type string;\n description \"test2\";\n}"
print ["%s %s"%(x.group(1), x.group(2)) for x in re.finditer(p, test_str)]

请参阅IDEONE demo

输出：

['a test1', 'b test2']

正则表达式匹配：

^ - 行首
leaf - leaf字面意思
\s+(\S+) - 一个或多个空格，然后捕获一个包含1个或多个非空白字符的序列
[^{]* - 除{
\{ - 文字{
[^}]* - 除}
\bdescription - 全文description
\s+"([^"]+)" - 一个或多个空格符号，然后是"，然后捕获除双引号之外的1个或多个字符，然后匹配另一个双引号
[^}]* - 除}
} - 文字}。

Answer 3

我正在使用finditer，以便您可以迭代多个匹配项：

import re

line = """leaf a include-1 {
 type 1;
 description "test1";
}
leaf b include-2 {
 type string;
 description "test2";
}"""

regExStr = '^leaf (\w) include-.*?description \"(.*?)\";.*?}'
compiled =  re.compile(regExStr, re.DOTALL | re.MULTILINE)
matched = compiled.finditer(line)
for m in matched:
    print m.groups()

打印：

('a', 'test1')
('b', 'test2')

您可以看到每个结果都是一个元组，第一个元素（m.group(1)）是您的叶子名称，第二个元素（m.group(2)）是描述。

Python正则表达式：跨多行搜索

3 个答案: