Question

我正在尝试使用Beautiful Soup做两件事：

查找并打印具有特定类别的div
查找并打印包含特定文字的链接

第一部分正在运作。第二部分是返回一个空列表，即[]。在尝试解决此问题时，我创建了以下按预期工作的内容：

from bs4 import BeautifulSoup

def my_funct():
    content = "<div class=\"class1 class2\">some text</div> \
        <a href='#' title='Text blah5454' onclick='blahblahblah'>Text blah5454</a>"
    soup = BeautifulSoup(content)
    thing1 = soup("div", "class1 class2")
    thing2 = soup("a", text="Text")
    print thing1
    print thing2

my_funct()

在SciTE编辑器中查看原始内容（我的实际实现）的来源之后。但是，一个区别是在链接文本中LF和->之间的新行上有Text和四个blah5454，例如：

Enter image description here

因此我认为这就是我得到一个空[]的原因。

我的问题是：

这可能是原因吗？
如果是这样，是“剥离”这些角色的最佳解决方案，如果是，那么最好的方法是什么？

Answer 1

text参数仅匹配整个文本内容。您需要使用正则表达式：

import re

thing2 = soup("a", text=re.compile(r"\bText\b"))

\b字边界锚点确保您只匹配整个单词，而不是部分单词。请注意这里使用的r''原始字符串文字，\b在解释为普通字符串时表示不同;如果你不在这里使用原始字符串文字，你必须加倍反斜杠。

演示：

>>> from bs4 import BeautifulSoup
>>> content = "<div class=\"class1 class2\">some text</div> \
...         <a href='#' title='wooh!' onclick='blahblahblah'>Text blah5454</a>"
>>> soup = BeautifulSoup(content)
>>> soup("a", text='Text')
[]
>>> soup("a", text=re.compile(r"\bText\b"))
[<a href="#" onclick="blahblahblah" title="wooh!">Text blah5454</a>]

如何删除干扰Beautifulsoup返回特定文本链接的字符？

1 个答案: