Question

正如标题所述，我试图从多个网站获取所有文本数据。如果我使用以下内容，我可以获取文本数据 (<p>)：

url = “https://Somerandomwebsite.com”
response = requests.get(url, timeout=5)
tree = lxml.html.fromstring(response.text)
things = tree.find_class(“the class that contains <p>”)
data = [_.text_content() for _ in things]
Print(data)

然而，这仅适用于一个网站，因为它抓取了上面的类，然后抓取了文本。我不想进入每个网站并找到文本数据所属的类。有没有办法只在所有网站上搜索文本（<p>），然后获取所有数据？

任何帮助将不胜感激。

Answer 1

我认为beautifulsoup4要容易得多

 import requests
 from bs4 import BeautifulSoup

 url = 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/'
 response = requests.get(url)
 soup = BeautifulSoup(response.text)
 things = soup.find_all('p')
 data = [_.getText() for _ in things]
 print(data)

并且您可以创建一个循环来访问每个网站并提取所有 <p>，然后可以附加到同一个列表或根据需要对其进行处理

尝试从多个网站访问 <p>

1 个答案: