我想在下面提取我的文字,并将其汇总到一个对象列表中,如下所示。我知道这可以用BeautifulSoup完成。
启动html文本:
input_string = "peanut butter1
<ul id="ul0002" list-style="none">peanut butter2
<li id="ul0002-0001" num="0000">2.0 to 6.0 mg of 17β-estradiol and</li>
<li id="ul0002-0002" num="0000">0.020 mg of ethinylestradiol;</li>
<br>
<li id="ul0002-0003" num="0000">0.25 to 0.30 mg of drospirenone and</li>peanut butter3
</ul>peanut butter4"
期望的输出:
list1 = [
['peanut butter1', 'no tag'],
['peanut butter2', 'ul'],
['2.0 to 6.0 mg of 17β-estradiol and', 'li'],
['0.020 mg of ethinylestradiol;', 'li'],
['<br>', 'no tag'],
['0.25 to 0.30 mg of drospirenone and', 'li'],
['peanut butter3', 'no tag'],
['peanut butter4', 'no tag'],
]
以下不会产生我想要的输出:
x = input_string.findAll()
list1 = []
for y in x:
list1.append([y.renderContents(), y.name])
答案 0 :(得分:1)
我们的想法是迭代BeautifulSoup
个文本项并检查parent
:
from pprint import pprint
from bs4 import BeautifulSoup
input_string = """peanut butter1
<ul id="ul0002" list-style="none">peanut butter2
<li id="ul0002-0001" num="0000">2.0 to 6.0 mg of 17β-estradiol and</li>
<li id="ul0002-0002" num="0000">0.020 mg of ethinylestradiol;</li>
<br>
<li id="ul0002-0003" num="0000">0.25 to 0.30 mg of drospirenone and</li>peanut butter3
</ul>peanut butter4"""
soup = BeautifulSoup(input_string, 'html.parser')
result = []
for item in soup.find_all(text=True):
value = item.strip()
if value:
parent = 'no tag' if item.parent.parent is None else item.parent.name
result.append([parent, value])
pprint(result)
打印:
[['no tag', u'peanut butter1'],
[u'ul', u'peanut butter2'],
[u'li', u'2.0 to 6.0 mg of 17\u03b2-estradiol and'],
[u'li', u'0.020 mg of ethinylestradiol;'],
[u'li', u'0.25 to 0.30 mg of drospirenone and'],
[u'br', u'peanut butter3'],
['no tag', u'peanut butter4']]
希望有所帮助。