Question

我将使用不同的htmls分析很多网站，我试图使用BeautifulSoup找到包含特定文本（在html内部）的所有行。

r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")               
for text in soup.find_all():
    if "price" in text:
        print text

这种方法不起作用（即使＆＃34;价格＆＃34;在html中提到超过40倍）。也许有更好的方法来做到这一点？

Answer 1

为什么不让BeautifulSoup找到包含所需文字的节点：

for node in soup.find_all(text=lambda x: x and "price" in x):
    print(node)

Answer 2

要从给定的网址中提取所有文字，您可以使用以下内容：

r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")               

for element in soup.findAll(['script', 'style']):
    element.extract()

text = soup.get_text()

这也会删除script和style部分内可能不需要的文字。然后，您可以使用该文本搜索所需的文本。

Answer 3

你不必使用美丽的汤来找到html中的特定文本，而不是你可以使用该请求。例如：

r = requests.get(url)
if 'specific text' in r.content:
    print r.content

Answer 4

在bs4 4.7.1中，可以将:contains伪类与*一起使用以考虑所有元素。显然有些重复，因为父母可能包含相同文字的孩子。在这里，我搜索price。

import requests
from bs4 import BeautifulSoup

url = 'https://www.visitsealife.com/brighton/tickets/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
items = soup.select('*:contains(price)')
print(items)
print(len(items))

BeautifulSoup查找所有特定文本

4 个答案: