我希望使用BeautifulSoup从div
内的标题和<strong>
标记内的文本中提取文本字符串。
我可以使用soup.h1
获取标题,但我想获得h1
<div class="site-content">
HTML:
<div class="site-content"><h1>Here is the title<strong>( And a bit more! )</strong></h1></div>
所以我想获得Here is the title
和( And a bit more! )
有人可以帮忙吗?
谢谢!
答案 0 :(得分:1)
您可以使用find attrs
参数,例如:
soup.find('div', attrs={'class': 'site-content'}).h1
编辑:仅获取直接文字
for div in soup.findAll('div', attrs={'class': 'site-content'}):
print ''.join([x for x in div.h1.contents \
if isinstance(x, bs4.element.NavigableString)])
使用lxml and xpath生活更轻松:
>>> from lxml import html
>>> root = html.parse('x.html')
>>> print root.xpath('//div[@class="site-content"]/h1/text()')
['Here is the title']
>>> print root.xpath('//div[@class="site-content"]/h1//text()')
['Here is the title', '( And a bit more! )']
>>> print root.xpath('//div[@class="site-content"]/h1/strong/text()')
['( And a bit more! )']
答案 1 :(得分:0)
使用BeautifulSoup从div内的标题和标签内的文本中提取文本字符串的代码。
>>> from bs4 import BeautifulSoup
>>> data = """<div class="site-content"><h1>Here is the title<strong>( And a bit more! )</strong></h1>"""
>>> soup = BeautifulSoup(data, "html.parser")
>>> reqText = soup.find('div', {"class":"site-content"}).text
>>> reqText
'Here is the title( And a bit more! )'