html看起来像这样:
<td class='Thistd'><a ><img /></a>Here is some text.</td>
我只想在<td>
中获取字符串。我不需要<a>...</a>
。
我怎么能这样做?
我的代码:
from bs4 import BeautifulSoup
html = """<td class='Thistd'><a><img /></a>Here is some text.</td>"""
soup = BeautifulSoup(html)
tds = soup.findAll('td', {'class': 'Thistd'})
for td in tds:
print td
print '============='
我得到的是<td class='Thistd'><a ><img /></a>Here is some text.</td>
但我只需要Here is some text.
答案 0 :(得分:5)
<强>代码:强>
from bs4 import BeautifulSoup
html = """<td class='Thistd'><a ><img /></a>Here is some text.</td>"""
soup = BeautifulSoup(html)
tds = soup.findAll('td', {'class': 'Thistd'})
for td in tds:
print td.text#the only change you need to do
print '============='
<强>输出:强>
Here is some text.
=============
注意:强>
.text
用于仅获取给定bs4对象的text属性,在这种情况下,它是td
标记。有关详细信息,请查看official site
答案 1 :(得分:3)
使用td.getText()
从元素中获取纯文本。
<强>即。)强>
for td in tds:
print td.getText()
print '============='
<强>输出:强>
Here is some text.
=============
修改强>
您可以删除<a>
元素,然后打印左侧。.extract
方法从可用的bs4对象中删除该特定标记
<强>即。)强>
for td in tds:
td.a.extract()
print td
<强>输出:强>
<td class="Thistd">Here is some<b>here is a b tag </b></td>