使用Beautifulsoup将<ul> <li>和无界项目转换为列表</li> </ul>

时间:2014-03-13 19:33:02

标签: python xml xml-parsing beautifulsoup

我想在下面提取我的文字,并将其汇总到一个对象列表中,如下所示。我知道这可以用BeautifulSoup完成。

启动html文本:

input_string = "peanut butter1
<ul id="ul0002" list-style="none">peanut butter2
    <li id="ul0002-0001" num="0000">2.0 to 6.0 mg of 17&#x3b2;-estradiol and</li>
    <li id="ul0002-0002" num="0000">0.020 mg of ethinylestradiol;</li>
    <br>
    <li id="ul0002-0003" num="0000">0.25 to 0.30 mg of drospirenone and</li>peanut butter3
</ul>peanut butter4"

期望的输出:

list1 = [
    ['peanut butter1', 'no tag'],
    ['peanut butter2', 'ul'],
    ['2.0 to 6.0 mg of 17&#x3b2;-estradiol and', 'li'],
    ['0.020 mg of ethinylestradiol;', 'li'],
    ['<br>', 'no tag'],
    ['0.25 to 0.30 mg of drospirenone and', 'li'],
    ['peanut butter3', 'no tag'],
    ['peanut butter4', 'no tag'],
]

以下不会产生我想要的输出:

x = input_string.findAll()
list1 = []
for y in x:
    list1.append([y.renderContents(), y.name])

1 个答案:

答案 0 :(得分:1)

我们的想法是迭代BeautifulSoup个文本项并检查parent

from pprint import pprint
from bs4 import BeautifulSoup


input_string = """peanut butter1
<ul id="ul0002" list-style="none">peanut butter2
    <li id="ul0002-0001" num="0000">2.0 to 6.0 mg of 17&#x3b2;-estradiol and</li>
    <li id="ul0002-0002" num="0000">0.020 mg of ethinylestradiol;</li>
    <br>
    <li id="ul0002-0003" num="0000">0.25 to 0.30 mg of drospirenone and</li>peanut butter3
</ul>peanut butter4"""

soup = BeautifulSoup(input_string, 'html.parser')

result = []
for item in soup.find_all(text=True):
    value = item.strip()
    if value:
        parent = 'no tag' if item.parent.parent is None else item.parent.name
        result.append([parent, value])

pprint(result)

打印:

[['no tag', u'peanut butter1'],
 [u'ul', u'peanut butter2'],
 [u'li', u'2.0 to 6.0 mg of 17\u03b2-estradiol and'],
 [u'li', u'0.020 mg of ethinylestradiol;'],
 [u'li', u'0.25 to 0.30 mg of drospirenone and'],
 [u'br', u'peanut butter3'],
 ['no tag', u'peanut butter4']]

希望有所帮助。