在python中迭代HTML中的“class”属性?

时间:2013-01-10 07:14:11

标签: python html beautifulsoup

我有来自网站的HTML字符串。以下是其中的一部分。

<p class="news-body">
<a href="/ci/content/player/45568.html" target="new">Paul Harris,</a> the South African spinner, is to retire at the end of the season, bringing to an end a 14-year first-class career.
</p>
<p class="news-body">
 Harris played 37 Tests for South Africa with his slow-left arm but nearly turned his back on international cricket after a stint as a Kolpak with Warwickshire in 2006. The retirement of Nicky Boje prompted Harris' eventual call-up and he went on to take 103 wickets at 37.87.
</p>
<p class="news-body">
His last Test was in Cape Town against India in January 2011 after which he was dropped for legspinner Imran Tahir. As recently as the start of this season he indicated his intention to compete for a Test place once again.
</p>  </div>
   <!-- body area ends here  -->

我想提取所有<p class="news-body">内的所有上述文字。

我使用过美丽的汤。

from BeautifulSoup import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print parsed_html.body.find('p', attrs={'class':'news-body'}).text

不幸的是,上面只返回第一行。是,

Paul Harris,the South African spinner, is to retire at the end of the season, bringing to an end a 14-year first-class career.

我希望它能够返回所有文本。有人可以帮帮我吗?

1 个答案:

答案 0 :(得分:1)

find只找到第一个元素。你想要findAll,它将返回一个元素列表。

您可以像这样加入他们的文字:

text = '\n'.join(element.text for element in soup.findAll('p', ...))

另外,我建议您升级到BeautifulSoup的最新版本。