我的互联网连接是通过代理进行身份验证,当我尝试运行scraoy库来制作更简单的示例时,例如:
scrapy shell http://stackoverflow.com
在您使用XPath选择器请求某些内容之前,所有这一切都可以,但响应是下一个:
>>> hxs.select('//title')
[<HtmlXPathSelector xpath='//title' data=u'<title>ERROR: Cache Access Denied</title'>]
或者,如果您尝试运行在项目中创建的任何蜘蛛,则会出现以下错误:
C:\Users\Victor\Desktop\test\test>scrapy crawl test
2012-08-11 17:38:02-0400 [scrapy] INFO: Scrapy 0.16.5 started (bot: test)
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetCon
sole, CloseSpider, WebService, CoreStats, SpiderState
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddlewa
re, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Enabled item pipelines:
2012-08-11 17:38:02-0400 [test] INFO: Spider opened
2012-08-11 17:38:02-0400 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped
0 items (at 0 items/min)
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
4
2012-08-11 17:38:02-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6081
2012-08-11 17:38:47-0400 [test] DEBUG: Retrying <GET http://automation.whatismyi
p.com/n09230945.asp> (failed 1 times): TCP connection timed out: 10060: Se produ
jo un error durante el intento de conexi¾n ya que la parte conectada no respondi
¾ adecuadamente tras un periodo de tiempo, o bien se produjo un error en la cone
xi¾n establecida ya que el host conectado no ha podido responder..
2012-08-11 17:39:02-0400 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped
0 items (at 0 items/min)
...
2012-08-11 17:39:29-0400 [test] INFO: Closing spider (finished)
2012-08-11 17:39:29-0400 [test] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 3,
'downloader/request_bytes': 732,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 8, 11, 21, 39, 29, 908000),
'log_count/DEBUG': 9,
'log_count/ERROR': 1,
'log_count/INFO': 5,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2012, 8, 11, 21, 38, 2, 876000)}
2012-08-11 17:39:29-0400 [test] INFO: Spider closed (finished)
看起来我的代理问题了。如果有人知道使用scrapy和身份验证代理的方法,请告诉我。
答案 0 :(得分:4)
Scrapy使用HttpProxyMiddleware支持代理:
此中间件通过设置将HTTP代理设置为用于请求 Request对象的代理元值。像Python标准一样 库模块urllib和urllib2,它遵循以下环境 变量:
- http_proxy
- https_proxy
- NO_PROXY
另见:
答案 1 :(得分:0)
Mahmoud M. Abdel-Fattah 重复答案,因为该页面现在不可用。归功于他,但是,我做了一些修改。
如果middlewares.py
已经存在,则将以下代码添加到其中。
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass.encode())
#encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + \
str(encoded_user_pass)
在settings.py文件中,添加以下代码
DOWNLOADER_MIDDLEWARES = {
'project_name.middlewares.ProxyMiddleware': 100,
}
这应该通过设置http_proxy
来起作用。但是,就我而言,我正在尝试使用HTTPS协议访问URL,需要设置https_proxy
,我仍在调查中。任何线索都将有很大帮助。