如何使用BeautifulSoup4优雅地获取html td的顶级文本?

时间:2015-04-16 07:52:27

标签: python html beautifulsoup bs4

下面是一个用beautifulsoup4解析的简单html片段,我希望提取顶级原始文本 hello

mysoup = BeautifulSoup('<td>hello<script type="text/javascript">world</script></td>')

我尝试了几种直观的方法但没有预期的结果:

mysoup.text            # u'helloworld'
mysoup.contents        # [<html><body><td>hello<script type="text/javascript">world</script></td></body></html>]
list(mysoup.strings)   # [u'hello ', u'world']

那么如何实现这一目标呢?

1 个答案:

答案 0 :(得分:0)

首先,获取对td节点的引用。然后,遍历其子项并查看其中哪些are strings

from bs4 import BeautifulSoup
mysoup = BeautifulSoup('<td>hello<script type="text/javascript">world</script></td>')
td = mysoup.find('td')
print [s for s in td.children if isinstance(s, basestring)]