Question

我正在尝试获取包含以下文本模式的HTML文档中的元素：＃\ S {11}

<h2> this is cool #12345678901 </h2>

因此，之前的匹配将使用：

soup('h2',text=re.compile(r' #\S{11}'))

结果如下：

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

我能够获得匹配的所有文本（见上面的行）。但我希望文本的父元素匹配，因此我可以将其用作遍历文档树的起点。在这种情况下，我希望返回所有h2元素，而不是文本匹配。

想法？

Answer 1

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S{11}')):
    print elem.parent

打印：

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

Answer 2

当BeautifulSoup.NavigableString用作条件而不是text=时，BeautifulSoup搜索操作会提供[{1}}个对象的列表。检查对象的BeautifulSoup.Tag以查看可用的属性。在这些属性中，由于changes in BS4，__dict__优于parent。

previous

Answer 3

使用bs4（Beautiful Soup 4），OP的尝试与预期完全一样：

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

返回[<h2> this is cool #12345678901 </h2>]。

使用BeautifulSoup查找包含特定文本的HTML标记

3 个答案: