用美丽的汤提取标题和强标记

时间:2014-01-28 20:33:07

标签: python html web-scraping beautifulsoup

我希望使用BeautifulSoup从div内的标题和<strong>标记内的文本中提取文本字符串。

我可以使用soup.h1获取标题,但我想获得h1

中专门设置的<div class="site-content">

HTML:

<div class="site-content"><h1>Here is the title<strong>( And a bit more! )</strong></h1></div>

所以我想获得Here is the title( And a bit more! ) 有人可以帮忙吗?

谢谢!

2 个答案:

答案 0 :(得分:1)

您可以使用find attrs参数,例如:

soup.find('div', attrs={'class': 'site-content'}).h1

编辑:仅获取直接文字

for div in soup.findAll('div', attrs={'class': 'site-content'}):
    print ''.join([x for x in div.h1.contents \
                                 if isinstance(x, bs4.element.NavigableString)])

使用lxml and xpath生活更轻松:

>>> from lxml import html
>>> root = html.parse('x.html')
>>> print root.xpath('//div[@class="site-content"]/h1/text()')
['Here is the title']
>>> print root.xpath('//div[@class="site-content"]/h1//text()')
['Here is the title', '( And a bit more! )']
>>> print root.xpath('//div[@class="site-content"]/h1/strong/text()')
['( And a bit more! )']

答案 1 :(得分:0)

使用BeautifulSoup从div内的标题和标签内的文本中提取文本字符串的代码。

>>> from bs4 import BeautifulSoup
>>> data = """<div class="site-content"><h1>Here is the title<strong>( And a bit more! )</strong></h1>"""
>>> soup = BeautifulSoup(data, "html.parser")
>>> reqText = soup.find('div', {"class":"site-content"}).text
>>> reqText
'Here is the title( And a bit more! )'