scrapy https proxy 403错误 - 在curl中工作

时间:2017-08-15 18:03:19

标签: python http https proxy scrapy

我在启用了HttpProxyMiddleware的Linux上有一个scrapy 1.4.0项目,即我的settings.py包括:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 10,
}

当我使用以下命令运行我的蜘蛛(名为sslproxies)时,出现错误:

export https_proxy=https://123.123.123.123:3128
scrapy crawl sslproxies -o output/data.csv

相关错误:

2017-08-15 18:57:20 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.sslproxies.org/> (referer: None)
2017-08-15 18:57:20 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.sslproxies.org/>: HTTP status code is not handled or not allowed
2017-08-15 18:57:20 [scrapy.core.engine] INFO: Closing spider (finished)

403意味着禁止该请求。但是,如果我使用curl测试代理服务器:

curl -vx https://123.123.123.123:3128 https://httpbin.org/headers

我得到了有效的回复并使用了代理服务器:

* Establish HTTP proxy tunnel to httpbin.org:443
> CONNECT httpbin.org:443 HTTP/1.1
> Host: httpbin.org:443
> User-Agent: curl/7.47.0
> Proxy-Connection: Keep-Alive
> 
< HTTP/1.1 200 Connection established

如果我通过取消设置https_proxy环境变量来绕过代理,则蜘蛛可以正常工作。 我在scrapy http代理中间件配置中遗漏了什么吗?

1 个答案:

答案 0 :(得分:0)

2017-08-15 18:57:20 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.sslproxies.org/> (referer: None)

说明您的蜘蛛向https://www.sslproxies.org/

发出请求

这样做,创建另一个像这样的中间件

class CustomProxyMiddleware(object):

    def process_request(self, request, spider):

        request.meta['proxy'] = "https://123.123.123.123:3128"

这意味着Spider所做的每个请求都会使用代理。