Question

我们使用scrapy + splash进行爬行，我们希望使用多个代理。但是，splash只支持单个代理https://splash.readthedocs.io/en/stable/api.html#proxy-profiles。

[proxy]

; required
host=proxy.crawlera.com
port=8010

; optional, default is no auth
username=username
password=password

; optional, default is HTTP. Allowed values are HTTP and SOCKS5
type=HTTP

如何在使用scrapy + splash进行抓取时使用多个代理？

Answer 1

有几种选择：

使用多个配置文件（正如Rafael Almeida在评论中所建议的那样）;
为每个请求传递不同的代理网址（请参阅http://splash.readthedocs.io/en/stable/api.html#arg-proxy）;
编写Splash Lua脚本并在request:set_proxy回调中使用splash:on_request - 文档中有一个示例。这样，您可以为页面初始化的不同请求设置不同的代理，而不是每个呈现页面的单个代理。我不知道在phantomjs或selenium等其他浏览器自动化工具中如何做到这一点。

使用scrapy + splash进行爬网时如何使用多个代理？

1 个答案: