使用python正则表达式将<ul> <li>和无界项转换为列表</li> </ul>

时间:2014-03-13 18:42:27

标签: python regex html-lists

我想在下面提取我的文字,并将其汇总到一个对象列表中,如下所示。我知道这可以用正则表达式以某种方式完成。请协助。

启动html文本:

peanut butter1
<ul id="ul0002" list-style="none">peanut butter2
    <li id="ul0002-0001" num="0000">2.0 to 6.0 mg of 17&#x3b2;-estradiol and</li>
    <li id="ul0002-0002" num="0000">0.020 mg of ethinylestradiol;</li>
    <br>
    <li id="ul0002-0003" num="0000">0.25 to 0.30 mg of drospirenone and</li>peanut butter3
</ul>peanut butter4

期望的输出:

list = [
    ['peanut butter1', 'no tag'],
    ['peanut butter2', 'ul'],
    ['2.0 to 6.0 mg of 17&#x3b2;-estradiol and', 'li'],
    ['0.020 mg of ethinylestradiol;', 'li'],
    ['<br>', 'no tag'],
    ['0.25 to 0.30 mg of drospirenone and', 'li'],
    ['peanut butter3', 'no tag'],
    ['peanut butter4', 'no tag'],
]

1 个答案:

答案 0 :(得分:1)

我同意之前关于解析HTML的评论。但是,为了好玩并假设逐行解析,您可以尝试以下内容:

ss="""
peanut butter1
<ul id="ul0002" list-style="none">peanut butter2
    <li id="ul0002-0001" num="0000">2.0 to 6.0 mg of 17&#x3b2;-estradiol and</li>
    <li id="ul0002-0002" num="0000">0.020 mg of ethinylestradiol;</li>
    <br>
    <li id="ul0002-0003" num="0000">0.25 to 0.30 mg of drospirenone and</li>peanut butter3
</ul>peanut butter4
"""
import re
tags = re.compile (r".*?<([^/]\w*?) .*?>(.*?)</\1>") # find tag like <li ...>...</li>
start = re.compile(r".*?<([^/]\w*?) .*?>(.*)") # find starting tags with attributes
end = re.compile(r"</.*?>")
r=[]
for s in ss.split("\n"):
    if not s.strip(): continue
    st = re.match(start,s)
    if st: # start tag exists
        m = re.match(tags,s) 
        if m: # full terminated tag
            r.append(list(reversed(m.groups())))
            extra = s[m.end():].strip()
            if extra:
                r.append([extra,"no tag"])
        else: # half tag start
            r.append(list(reversed(st.groups())))
    else: # no start tag
        s = re.sub(end, "", s) # remove closing tags
        r.append([s.strip(),"no tag"])
print "\n".join([str(s) for s in r])

希望这有帮助!