更新

Question

我正在使用BeautifulSoup和Python进行网页抓取。

例如，我有以下html文本

<body>
    <h5 class="h-bar">
        <b class="caret"></b>
        Model 11111
        Set Item
    </h5>
</body>

现在，我正在尝试查找其文本中包含“设置项目”一词的任何标记。

我尝试了以下内容：

soup.find_all('h5', text="Set Item")

我希望得到这个：

    <h5 class="h-bar">
        <b class="caret"></b>
        Model 11111
        Set Item
    </h5>

然而，这会返回None ..我不知道为什么美味的汤找不到比赛.. 如何在文本中使用“设置项目”检测标记？

Answer 1

我也是一名BeautifulSoup新手。必须有一个更好的方法，但这个似乎有效：

from bs4 import BeautifulSoup
import re

def predicate(element):
    pattern = re.compile(r'Set Item')
    return element.name == u'h5' and element.find(text=pattern) 

if __name__ == '__main__':
    soup = BeautifulSoup(open('index.html').read())
    found = soup.find_all(predicate) # found: a list of elements
    print 'Found:', found

请原谅open（）。read（）链。我只是在偷懒。

输出：

Found: [<h5 class="h-bar">
<b class="caret"></b>
        Model 11111
        Set Item
    </h5>]

更新

谓词不需要使用正则表达式：

def predicate(e):
    return e and e.name == u'h5' and 'Set Item' in e.text

Beautifulsoup使用特定文本查找HTML标签

1 个答案:

更新