正则表达式匹配逗号分隔的key = value列表,其中value可以包含h​​tml?

时间:2017-03-01 10:41:33

标签: python regex

我正在尝试匹配逗号分隔的key = value列表,其中该值可以很好地包含很多东西。

我使用的模式完全来自related question

split_up_pattern = re.compile(r'([^=]+)=([^=]+)(?:,|$)', re.X|re.M)

但是当值包含html时会导致问题。

以下是一个示例脚本:

import re

text = '''package_contents=<p>The basic Super&nbsp;1050 machine includes the following:</p>
<p>&nbsp;</p>
<table style="" height: 567px;"" border=""1"">
<tbody>
<tr>
<td style=""width: 200px;"">
<ul>
<li>uper 1150 machine</li>
</ul>
</td>
<td>&nbsp;With dies fitted.
<ul>
<li>The Super 1050</li>
</ul>
</td>
</tr>
</tbody>
<table>,second_attribute=something else'''

split_up_pattern = re.compile(r'([\w_^=]+)=([^=]+)(?:,|$)', re.X|re.M)

matches = split_up_pattern.findall(text)

import ipdb; ipdb.set_trace()

print(matches)

输出:

ipdb> matches[0]
('package_contents', '<p>The basic Super&nbsp;1050 machine includes the following:</p>\n\n<p>&nbsp;</p>\n')
ipdb> matches[1]
('border', '""1"">\n\n<tbody>\n\n<tr>\n')
ipdb> matches[2]
('style', '""width: 200px;"">\n\n<ul>\n\n<li>uper 1150 machine</li>\n\n</ul>\n\n</td>\n\n<td>&nbsp;With dies fitted.\n\n<ul>\n\n<li>The Super 1050</li>\n\n</ul>\n\n</td>\n\n</tr>\n</tbody>\n<table>')
ipdb> matches[3]
('second_attribute', 'something else')

我想要的输出是:

matches[0]

('package_contents', '<p>The basic Super&nbsp;1050 machine includes the following:</p><p>&nbsp;</p><table style="" height: 567px;"" border=""1""><tbody><tr><td style=""width: 200px;""><ul><li>uper 1150 machine</li></ul></td><td>&nbsp;With dies fitted.<ul><li>The Super 1050</li></ul></td></tr>
</tbody><table>',)

matches[1]

('second_attribute', 'something else')

1 个答案:

答案 0 :(得分:1)

不是基于分隔符(逗号或等号)进行狭义解析,而是可以利用下一个键值对以这样的方式开头的事实:

,WORD=

以下是该想法的草图:

import re

text = '''...your example...'''

# Start of the string or our ,WORD= pattern.
rgx_spans = re.compile(r'(\A|,)\w+=')

# Get the start-end positions of all matches.
spans = [m.span() for m in rgx_spans.finditer(text)]

# Use those positions to break up the string into parsable chunks.
for i, s1 in enumerate(spans):
    try:
        s2 = spans[i + 1]
    except IndexError:
        s2 = (None, None)

    start = s1[0]
    end = s2[0]
    key, val = text[start:end].lstrip(',').split('=', 1)

    print()
    print(s1, s2)
    print((key, val))