我想在下面提取我的文字,并将其汇总到一个对象列表中,如下所示。我知道这可以用正则表达式以某种方式完成。请协助。
启动html文本:
peanut butter1
<ul id="ul0002" list-style="none">peanut butter2
<li id="ul0002-0001" num="0000">2.0 to 6.0 mg of 17β-estradiol and</li>
<li id="ul0002-0002" num="0000">0.020 mg of ethinylestradiol;</li>
<br>
<li id="ul0002-0003" num="0000">0.25 to 0.30 mg of drospirenone and</li>peanut butter3
</ul>peanut butter4
期望的输出:
list = [
['peanut butter1', 'no tag'],
['peanut butter2', 'ul'],
['2.0 to 6.0 mg of 17β-estradiol and', 'li'],
['0.020 mg of ethinylestradiol;', 'li'],
['<br>', 'no tag'],
['0.25 to 0.30 mg of drospirenone and', 'li'],
['peanut butter3', 'no tag'],
['peanut butter4', 'no tag'],
]
答案 0 :(得分:1)
我同意之前关于解析HTML的评论。但是,为了好玩并假设逐行解析,您可以尝试以下内容:
ss="""
peanut butter1
<ul id="ul0002" list-style="none">peanut butter2
<li id="ul0002-0001" num="0000">2.0 to 6.0 mg of 17β-estradiol and</li>
<li id="ul0002-0002" num="0000">0.020 mg of ethinylestradiol;</li>
<br>
<li id="ul0002-0003" num="0000">0.25 to 0.30 mg of drospirenone and</li>peanut butter3
</ul>peanut butter4
"""
import re
tags = re.compile (r".*?<([^/]\w*?) .*?>(.*?)</\1>") # find tag like <li ...>...</li>
start = re.compile(r".*?<([^/]\w*?) .*?>(.*)") # find starting tags with attributes
end = re.compile(r"</.*?>")
r=[]
for s in ss.split("\n"):
if not s.strip(): continue
st = re.match(start,s)
if st: # start tag exists
m = re.match(tags,s)
if m: # full terminated tag
r.append(list(reversed(m.groups())))
extra = s[m.end():].strip()
if extra:
r.append([extra,"no tag"])
else: # half tag start
r.append(list(reversed(st.groups())))
else: # no start tag
s = re.sub(end, "", s) # remove closing tags
r.append([s.strip(),"no tag"])
print "\n".join([str(s) for s in r])
希望这有帮助!