python BeautifulSoup如何获取标签之间的值?

时间:2014-07-31 20:21:29

标签: python html beautifulsoup

我的html结构是:

<div class="layout4-background">
    <h6 class="game">Game1. How to get all listings below and assign to class"game"?</h6>
    <ul>
        <li class="listing">
    </ul>
    <ul>
        <li class="listing">
    </ul>
    <ul>
        <li class="listing">
    </ul>
    <h6 class="game">Game2. How to get all listings below and assign to class"game?</h6>
    <ul>
        <li class="listing">
    </ul>
    <h6 class="game">Game3. How to get all listings below and assign to class"game?</h6>
    <ul>
        <li class="listing">
    </ul>
</div>

这是一个div区块。基本上我需要创建每个h6类的列表。第一个h6 - 3上市,第二个h6 - 1上市,第三个h6 - 1上市。有没有办法用BeautifulSoup做到这一点? 谢谢

1 个答案:

答案 0 :(得分:0)

您可以迭代.find_next_siblings() <ul>元素的结果:

from itertools import takewhile, ifilter

div = soup.find('div', class_='layout4-background')
for header in div.find_all('h6'):
    print header.get_text()
    listings = takewhile(lambda t: t.name == 'ul',
                         header.find_next_siblings(text=False))
    for listing in listings:
        # do something with listing

find_next_siblings()搜索查找不仅仅是文本节点的所有节点(跳过其间的空格)。 itertools.takewhile() iterable允许您选择 所有<ul>标记的下一个元素。

演示:

>>> from bs4 import BeautifulSoup
>>> from itertools import takewhile
>>> soup = BeautifulSoup('''\
... <div class="layout4-background">
...     <h6 class="game">Game1. How to get all listings below and assign to class"game"?</h6>
...     <ul>
...         <li class="listing">
...     </ul>
...     <ul>
...         <li class="listing">
...     </ul>
...     <ul>
...         <li class="listing">
...     </ul>
...     <h6 class="game">Game2. How to get all listings below and assign to class"game?</h6>
...     <ul>
...         <li class="listing">
...     </ul>
...     <h6 class="game">Game3. How to get all listings below and assign to class"game?</h6>
...     <ul>
...         <li class="listing">
...     </ul>
... </div>
... ''')
>>> div = soup.find('div', class_='layout4-background')
>>> for header in div.find_all('h6'):
...     print header.get_text()
...     listings = takewhile(lambda t: t.name == 'ul',
...                          header.find_next_siblings(text=False))
...     print 'Listings found:', len(list(listings))
... 
Game1. How to get all listings below and assign to class"game"?
Listings found: 3
Game2. How to get all listings below and assign to class"game?
Listings found: 1
Game3. How to get all listings below and assign to class"game?
Listings found: 1