Scrapy Crawler请求开始和停止时间以及请求期间使用的代理

时间:2018-07-09 09:30:30

标签: python scrapy scrapy-spider

我正在尝试使用代理中间件Scrapy_Proxies处理请求,响应和下载时间,如StackOverflow问题Scrapy request+response+download time所示。

我已经能够获得与Scrapy_Proxies一起工作的代理列表以及要输出的请求时间。但是,当我将启动/停止process_request与scrapy_proxies结合使用时,会得到KeyError: '__end_time'.

最终,我希望使用代理,并将请求的开始和结束时间输出到csv。

更新#1:问题1的解决方案是将scrapy_proxies setting.py放在时间设置之前。

问题1:如何解决KeyError?在setting.py中,将scrapy_proxies部分放在时间估计部分的上方。

问题2:是否可以使代理使用可输出到CSV的可解析项目。

这是我的代码:

settings.py

# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

PROXY_LIST = '/Users/guest/Documents/peel/peel/proxy/list.txt'
PROXY_MODE = 0

#https://stackoverflow.com/questions/15831955/scrapy-requestresponsedownload-time
DOWNLOADER_MIDDLEWARES = {
    'project.middlewares.DownloadTimer': 0,
}

test_1.py(蜘蛛)

class Test1Spider(CrawlSpider):
    name = 'test_1'
    allowed_domains = ['toscrape.com']
    start_urls = ['http://toscrape.com/']

    rules = [
        Rule(LinkExtractor(
            allow=['.*']),
            callback='parse_item',
            follow=True)]

    def parse_item(self, response):
        end_time = response.meta['__end_time']
        start_time = response.meta['__start_time']
        total_time = response.meta['__end_time'] - response.meta['__start_time']
        request_meta = self.logger.info('Download time: %.2f - %.2f = %.2f' % (response.meta['__end_time'], response.meta['__start_time'],
        response.meta['__end_time'] - response.meta['__start_time']))
        url = response.url
        response_meta = response.meta
        # proxy_ip = response_meta['proxy']

        print(end_time)
        print(start_time)
        print(total_time)
        print(request_meta)     
        print(response_meta)
        print(url)
        # print(proxy_ip)  

middlewares.py

from time import time
from scrapy.http import Response

#https://stackoverflow.com/questions/15831955/scrapy-requestresponsedownload-time

class DownloadTimer(object):
    def process_request(self, request, spider):
        request.meta['__start_time'] = time()
        # this not block middlewares which are has greater number then this
        return None

    def process_response(self, request, response, spider):
        request.meta['__end_time'] = time()
        return response  # return response coz we should

    def process_exception(self, request, exception, spider):
        request.meta['__end_time'] = time()
        return Response(
            url=request.url,
            status=110,
            request=request)

0 个答案:

没有答案