Question

我遇到一个问题，其中特定标签h2的soup.find_all成功，但指定文本的soup.find失败。

我需要找到带有各种文本的h2标签，如简介，结果等，如附图所示。

请问有人可以提供建议吗？谢谢。

print(soup.find_all('h2'))
[<h2 class="Heading">Abstract</h2>, 
<h2 class="Heading" data-role="collapsible-handle" tabindex="-1">Introduction<span class="section-icon"></span></h2>, 
<h2 class="Heading" data-role="collapsible-handle" tabindex="-1">Patients and methods<span class="section-icon"></span></h2>, 
<h2 class="Heading" data-role="collapsible-handle" tabindex="-1">Results<span class="section-icon"></span></h2>, 
<h2 class="Heading" data-role="collapsible-handle" tabindex="-1">Discussion<span class="section-icon"></span></h2>, 
<h2 class="Heading" data-role="collapsible-handle" id="copyrightInformation" tabindex="-1">Copyright information<span class="section-icon"></span></h2>, 
<h2 class="Heading" data-role="collapsible-handle" id="aboutarticle" tabindex="-1">About this article<span class="section-icon"></span></h2>, 
<h2 class="u-isVisuallyHidden">Article actions</h2>, <h2 class="u-h4 u-jsIsVisuallyHidden">Article contents</h2>, 
<h2 class="u-isVisuallyHidden">Cookies</h2>]

print(soup.find('h2', text='Introduction'))
None

Answer 1

试试这个：

soup.find(lambda el: el.name == "h2" and "Introduction" in el.text)

Answer 2

当我们使用text/string作为过滤器时，我们使用tag.string来获取文本并与过滤器进行比较，在这种情况下：

import bs4

html = '''<h2 class="Heading" data-role="collapsible-handle" tabindex="-1">Introduction<span class="section-icon"></span></h2>'''
soup = bs4.BeautifulSoup(html,'lxml')
print(soup.h2.string)

出：

None

为什么字符串返回None：

如果标签包含多个内容，则不清楚是什么 .string应该引用，所以.string被定义为None：

h2标记包含带有空文字的span标记，它会混淆并返回None

@Thomas Lehoux的答案是正确的方法。

这是BS3 API：

findNextSiblings(name, attrs, text, limit, **kwargs)

这是BS4 API：

find_next_siblings(name, attrs, string, limit, **kwargs)

您会注意到旧版本使用text，当前版本使用string，但它们都相同，它们都使用tag.string来获取价值，您可以同时使用他们。 BS4只是运用旧格式，就是这样。

我在两个版本中找不到任何tag.text API，但它的行为类似于tag.get_text()，它会连接标记下的所有文字。

在你的情况下：

soup.h2.string    >>>  None
soup.h2.text      >>>  Introduction
soup.h2.get_text()>>>  Introduction

简而言之：

text in filter is tag.string
text in tag itself is tag.text

我认为你在实践中使用find(string=' ')，它不那么令人困惑。

Answer 3

text='Introduction'搜索navigable strings，而不是tags

来自文档：

text是一个允许您搜索NavigableString对象的参数   而不是标签

你应该尝试：

print(soup.find(text='Introduction').parent)

Beautifulsoup标签 - find_all成功但发现失败

3 个答案: