Question

我正在开展一个项目，我需要进行一些刮擦。该项目位于Google App Engine上，我们目前正在使用Python 2.5。理想情况下，我们会使用PyQuery，但由于在App Engine和Python 2.5上运行，这不是一个选项。

我在finding an HTML tag with certain text上看到过像这样的问题，但它们并没有达到标准。

我有一些看起来像这样的HTML：

<div class="post">
    <div class="description">
        This post is about <a href="http://www.wikipedia.org">Wikipedia.org</a>
    </div>
</div>
<!-- More posts of similar format -->

在PyQuery中，我可以做这样的事情（据我所知）：

s = pq(html)
s(".post:contains('This post is about Wikipedia.org')")
# returns all posts containing that text

天真的，我曾经在BeautifulSoup中做过类似的事情：

soup = BeautifulSoup(html)
soup.findAll(True, "post", text=("This post is about Google.com"))
# []

然而，这没有产生任何结果。我改变了我的查询以使用正则表达式，并且得到了更多，但仍然没有运气：

soup.findAll(True, "post", text=re.compile(".*This post is about.*Google.com.*"))
# []

如果我省略Google.com，它会起作用，但我需要手动完成所有过滤。 是否仍然使用BeautifulSoup模拟:contains？

或者，是否有一些类似PyQuery的库可以在App Engine上运行（在Python 2.5上）？

Answer 1

来自BeautifulSoup文档（强调我的）：

“text是一个允许您搜索NavigableString对象的参数 而不是标签“

也就是说，您的代码：

soup.findAll(True, "post", text=re.compile(".*This post is about.*Google.com.*"))

不一样：

regex = re.compile('.*This post is about.*Google.com.*')
[post for post in soup.findAll(True, 'post') if regex.match(post.text)]

您必须删除Google.com的原因是，"This post is about"的BeautifulSoup树中有一个NavigableString对象，另一个"Google.com"，但它们位于不同的元素下。

顺便提一下，post.text存在但没有记录，所以我也不会依赖它，我偶然写了那段代码！使用其他一些方法将post下的所有文字放在一起。

如何使用BeautifulSoup模拟“：contains”？

1 个答案: