我正在尝试使用代理中间件Scrapy_Proxies处理请求,响应和下载时间,如StackOverflow问题Scrapy request+response+download time所示。
我已经能够获得与Scrapy_Proxies一起工作的代理列表以及要输出的请求时间。但是,当我将启动/停止process_request与scrapy_proxies结合使用时,会得到KeyError: '__end_time'.
最终,我希望使用代理,并将请求的开始和结束时间输出到csv。
更新#1:问题1的解决方案是将scrapy_proxies setting.py放在时间设置之前。
问题1:如何解决KeyError?在setting.py中,将scrapy_proxies部分放在时间估计部分的上方。
问题2:是否可以使代理使用可输出到CSV的可解析项目。
这是我的代码:
settings.py
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
PROXY_LIST = '/Users/guest/Documents/peel/peel/proxy/list.txt'
PROXY_MODE = 0
#https://stackoverflow.com/questions/15831955/scrapy-requestresponsedownload-time
DOWNLOADER_MIDDLEWARES = {
'project.middlewares.DownloadTimer': 0,
}
test_1.py(蜘蛛)
class Test1Spider(CrawlSpider):
name = 'test_1'
allowed_domains = ['toscrape.com']
start_urls = ['http://toscrape.com/']
rules = [
Rule(LinkExtractor(
allow=['.*']),
callback='parse_item',
follow=True)]
def parse_item(self, response):
end_time = response.meta['__end_time']
start_time = response.meta['__start_time']
total_time = response.meta['__end_time'] - response.meta['__start_time']
request_meta = self.logger.info('Download time: %.2f - %.2f = %.2f' % (response.meta['__end_time'], response.meta['__start_time'],
response.meta['__end_time'] - response.meta['__start_time']))
url = response.url
response_meta = response.meta
# proxy_ip = response_meta['proxy']
print(end_time)
print(start_time)
print(total_time)
print(request_meta)
print(response_meta)
print(url)
# print(proxy_ip)
middlewares.py
from time import time
from scrapy.http import Response
#https://stackoverflow.com/questions/15831955/scrapy-requestresponsedownload-time
class DownloadTimer(object):
def process_request(self, request, spider):
request.meta['__start_time'] = time()
# this not block middlewares which are has greater number then this
return None
def process_response(self, request, response, spider):
request.meta['__end_time'] = time()
return response # return response coz we should
def process_exception(self, request, exception, spider):
request.meta['__end_time'] = time()
return Response(
url=request.url,
status=110,
request=request)