Question

我在test.py中有以下BeautifulSoup代码。

#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1:

from bs4 import BeautifulSoup

import sys
soup = BeautifulSoup(sys.stdin.read(), 'html.parser', from_encoding='utf-8')

import re
from pprint import pprint
pprint(soup.find('div', text=re.compile(r'Scientific')))

这是两个html文件：

test1.html

<div class="heading4">Scientific/Research Contact(s)</div>

test2.html

<div class="heading4"><a name="_Scientific/Research_Contact(s)"></a>Scientific/Research Contact(s)</div>

以下是搜索结果。

$ ./test.py < test1.html
<div class="heading4">Scientific/Research Contact(s)</div>
$ ./test.py < test2.html
None

有人知道为什么第二个找不到？

Answer 1

按名称和文字搜索元素时，BeautifulSoup会检查元素的.string以匹配所需的文本。这种令人困惑的行为实际上涵盖在documentation：

中

如果您传递其中一个find *方法字符串和特定于标记的参数（如名称）， Beautiful Soup将搜索与您的标记特定条件匹配且其Tag.string与您的值匹配的标记串。它本身不会找到字符串。以前，Beautiful Soup忽略了特定于标签的参数并查找了字符串。

在第二种情况下，.string元素的div为None - 这就是您没有得到任何结果的原因。而是直接找到文本节点：

soup.find(text=re.compile(r"Scientific"))

而且，如果你需要实际的父元素，你可以从.parent得到它：

soup.find(text=re.compile(r"Scientific")).parent

如何使用BeautifulSoup匹配嵌入<a></a>的<div> </div>中的文本？

1 个答案: