使用BeautifulSoup / Python迭代DOM

时间:2014-03-19 04:44:14

标签: python html parsing html-parsing beautifulsoup

我有这个DOM:

<h2>Main Section</h2>
<p>Bla bla bla<p>
<h3>Subsection</h3>
<p>Some more info</p>

<h3>Subsection 2</h3>
<p>Even more info!</p>


<h2>Main Section 2</h2>
<p>bla</p>
<h3>Subsection</h3>
<p>Some more info</p>

<h3>Subsection 2</h3>
<p>Even more info!</p>

我想生成一个返回'Main Section','Bla bla bla','Subsection'等的迭代器。有没有一种方法可以使用BeautifulSoup?

1 个答案:

答案 0 :(得分:3)

这是一种方法。我们的想法是迭代主要部分(h2标记),并为每个h2标记迭代兄弟姐妹,直到下一个h2标记:

from bs4 import BeautifulSoup, Tag


data = """<h2>Main Section</h2>
<p>Bla bla bla<p>
<h3>Subsection</h3>
<p>Some more info</p>

<h3>Subsection 2</h3>
<p>Even more info!</p>


<h2>Main Section 2</h2>
<p>bla</p>
<h3>Subsection</h3>
<p>Some more info</p>

<h3>Subsection 2</h3>
<p>Even more info!</p>"""


soup = BeautifulSoup(data)
for main_section in soup.find_all('h2'):
    for sibling in main_section.next_siblings:
        if not isinstance(sibling, Tag):
            continue
        if sibling.name == 'h2':
            break
        print sibling.text
    print "-------"

打印:

Bla bla bla


Subsection
Some more info
Subsection 2
Even more info!
-------
bla
Subsection
Some more info
Subsection 2
Even more info!
-------

希望有所帮助。