我可以在不使用代理的情况下抓取页面。但是,当我添加代理时,scrapy会提供Error downloading: Connection was refused by other side: 61: Connection refused
或[<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.>]
,或者会引发超时异常。代理都是http类型。
以下是我添加到setting.py
PROXIES = [{'ip_port': '213.136.90.232:8080', 'user_pass': ''},]
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : 100,
'judgeinfo.middleware.RotateUserAgentMiddleware' :1,
'judgeinfo.middleware.ProxyMiddleware' :100,
}
这是我的middleware.py
import random
import base64
from settings import PROXIES
class ProxyMiddleware(object):
def process_request(self, request, spider):
proxy = random.choice(PROXIES)
if proxy['user_pass'] is not None:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
encoded_user_pass = base64.encodestring(proxy['user_pass'])
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
print "**************ProxyMiddleware have pass************" + proxy['ip_port']
else:
print "**************ProxyMiddleware no pass************" + proxy['ip_port']
request.meta['proxy'] = "http://%s" % proxy['ip_port']
我使用curl测试了代理并获得了正确的响应。
curl -L 'http://IP:port' -v "http://www.stackoverflow.com"
我还添加了随机选择的USER_AGENT,并设置了DOWNLOAD_DELAY = 3
答案 0 :(得分:1)
只是为了完整......(我讨厌在评论中找到答案)
middleware.py应更改为:
NavigationView