如何用beautifulsoup4提取HTML?

时间:2015-10-14 06:41:07

标签: python beautifulsoup

html看起来像这样:

<td class='Thistd'><a ><img /></a>Here is some text.</td>

我只想在<td>中获取字符串。我不需要<a>...</a>。 我怎么能这样做?

我的代码:

from bs4 import BeautifulSoup
html = """<td class='Thistd'><a><img /></a>Here is some text.</td>"""

soup = BeautifulSoup(html)
tds = soup.findAll('td', {'class': 'Thistd'})
for td in tds:
    print td
    print '============='

我得到的是<td class='Thistd'><a ><img /></a>Here is some text.</td>

但我只需要Here is some text.

2 个答案:

答案 0 :(得分:5)

<强>代码:

from bs4 import BeautifulSoup
html = """<td class='Thistd'><a ><img /></a>Here is some text.</td>"""

soup = BeautifulSoup(html)
tds = soup.findAll('td', {'class': 'Thistd'})
for td in tds:
    print td.text#the only change you need to do
    print '============='

<强>输出:

Here is some text.
=============

注意:

.text用于仅获取给定bs4对象的text属性,在这种情况下,它是td标记。有关详细信息,请查看official site

答案 1 :(得分:3)

使用td.getText()从元素中获取纯文本。

<强>即。)

for td in tds:
    print td.getText()
    print '============='

<强>输出:

Here is some text.
=============

修改

您可以删除<a>元素,然后打印左侧。.extract方法从可用的bs4对象中删除该特定标记

<强>即。)

for td in tds:
    td.a.extract()
    print td

<强>输出:

<td class="Thistd">Here is some<b>here is a b tag </b></td>