Question

我希望使用BeautifulSoup从div内的标题和<strong>标记内的文本中提取文本字符串。

我可以使用soup.h1获取标题，但我想获得h1

中专门设置的<div class="site-content">

HTML：

<div class="site-content"><h1>Here is the title<strong>( And a bit more! )</strong></h1></div>

所以我想获得Here is the title和( And a bit more! ) 有人可以帮忙吗？

谢谢！

Answer 1

您可以使用find attrs参数，例如：

soup.find('div', attrs={'class': 'site-content'}).h1

编辑：仅获取直接文字

for div in soup.findAll('div', attrs={'class': 'site-content'}):
    print ''.join([x for x in div.h1.contents \
                                 if isinstance(x, bs4.element.NavigableString)])

使用lxml and xpath生活更轻松：

>>> from lxml import html
>>> root = html.parse('x.html')
>>> print root.xpath('//div[@class="site-content"]/h1/text()')
['Here is the title']
>>> print root.xpath('//div[@class="site-content"]/h1//text()')
['Here is the title', '( And a bit more! )']
>>> print root.xpath('//div[@class="site-content"]/h1/strong/text()')
['( And a bit more! )']

Answer 2

使用BeautifulSoup从div内的标题和标签内的文本中提取文本字符串的代码。

>>> from bs4 import BeautifulSoup
>>> data = """<div class="site-content"><h1>Here is the title<strong>( And a bit more! )</strong></h1>"""
>>> soup = BeautifulSoup(data, "html.parser")
>>> reqText = soup.find('div', {"class":"site-content"}).text
>>> reqText
'Here is the title( And a bit more! )'

用美丽的汤提取标题和强标记

2 个答案: