Question

我一直在尝试做一些网络抓取。以下代码的主要思想是获取网站中 serach 输出中的 url。我在使用此代码时遇到问题：

import cloudscraper
from bs4 import BeautifulSoup

URL_WEB_URB = "https://adondevivir.com"

scraper = cloudscraper.create_scraper()
web = scraper.get("https://www.adondevivir.com/departamentos-en-alquiler-en-jesus-maria-ordenado-por-fechaonline-descendente-pagina-3.html")

depa_info = BeautifulSoup(web.text, "lxml")
publicaciones = depa_info.select(".postingCard")

pub_links = [URL_WEB_URB + ref["data-to-posting"] for ref in publicaciones]

print(pub_links)

我收到以下错误：

<块引用>

ProxyError: HTTPSConnectionPool(host='www.adondevivir.com', port=443): 最大重试次数超过 url: /departamentos-en-alquiler-en-jesus-maria-ordenado-por-fechaonline-descendente-pagina -3.html（由ProxyError（'无法连接到代理。'，OSError（'隧道连接失败：407身份验证'）））

我已经跟踪到该行的错误

web = scraper.get("https://www.adondevivir.com/departamentos-en-alquiler-en-jesus-maria-ordenado-por-fechaonline-descendente-pagina-3.html")

但我似乎无法修复它。我试过更改 URL（https 到 http），但事实并非如此。我已经搜索了答案，但没有找到有关此类代码的答案。

Answer 1

过了一会儿，我找到了解决方案。我不得不为抓取工具应用一个代理，以避免网络阻止它。

网页抓取问题

1 个答案: