覆盖scrapy logging esp。来自中间件

时间:2017-10-03 13:51:54

标签: logging scrapy robots.txt scrapy-middleware

我在一个项目中使用了Scrapy,我有自己的json日志格式。

我想避免Scrapy中的任何多行堆栈跟踪,尤其是来自robots.txt的中间件。我希望它是一个合适的单行错误或整个堆栈跟踪捆绑到一条消息中。

如何禁用或覆盖此日志记录行为?以下是我从robots.txt的下载中间件获得的一个示例堆栈跟踪

2017-10-03 19:08:57 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://www.somedomain.com/robots.txt>: DNS lookup failed: no results for hostname lookup: www.somedomain.com. Traceback (most recent call last):   File "/Users/auser/.virtualenvs/myenv/lib/python3.5/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)   File "/Users/auser/.virtualenvs/myenv/lib/python3.5/site-packages/twisted/python/failure.py", line 393, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)   File "/Users/auser/.virtualenvs/myenv/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))   File "/Users/auser/.virtualenvs/myenv/lib/python3.5/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)   File "/Users/auser/.virtualenvs/myenv/lib/python3.5/site-packages/twisted/internet/endpoints.py", line 954, in startConnectionAttempts
    "no results for hostname lookup: {}".format(self._hostStr) twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: www.somedomain.com.

1 个答案:

答案 0 :(得分:1)

我不知道为什么不喜欢错误信息是多行(这是异常追溯的打印)。无论如何,我们可以自定义scrapy日志记录的格式。假设您在scrapy命令行运行抓取脚本,例如scrapy crawlscrapy runspider。下面是一个示例代码(python 3版本),展示了如何使用自己的格式化程序。

import logging
import scrapy


class OneLineFormatter(logging.Formatter):

    def __init__(self, *args, **kwargs):
        super(OneLineFormatter, self).__init__(*args, **kwargs)

    def format(self, record):
        formatted = super(OneLineFormatter, self).format(record)
        return formatted.replace('\n', ' ')


class TestSpider(scrapy.Spider):
    name = "test"
    start_urls = [
        'http://www.somenxdomain.com/robots.txt',
    ]

    def __init__(self, fmt, datefmt, *args, **kwargs):
        my_formatter = OneLineFormatter(fmt=fmt, datefmt=datefmt)
        root = logging.getLogger()
        for h in root.handlers:
            h.setFormatter(my_formatter)
        super(TestSpider, self).__init__(*args, **kwargs)

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        return cls(settings.get('LOG_FORMAT'), settings.get('LOG_DATEFORMAT'))

    def parse(self, response):
        pass

以下是一些解释。

  1. Python记录工作流程。 scrapy本身使用python内置日志系统。因此,您需要一些python日志记录的基本知识,尤其是LoggerHandlerFilterFormatter类之间的关系。我强烈建议working flow的python日志记录。

  2. Scrapy日志记录和设置。如果您的蜘蛛由scrapy命令行运行,例如scrapy crawlscrapy runspider,则会调用scrapy函数[configure_logging](https://docs.python.org/2/howto/logging.html#logging-flow)来初始化日志记录。 scrapy logging的说明可以提供有关如何自定义日志记录的说明,并scrapy settings可以访问您的设置。

  3. 示例代码的工作原理。基本工作流程是:

    • 首先,您需要定义自己的格式化程序类以自定义日志记录格式。
    • 其次,在您的蜘蛛中,您需要访问格式设置才能初始化格式化程序类。
    • 最后,在您的蜘蛛中,您将获得root记录器并将格式化程序设置为root的所有处理程序。
  4. 如果您编写自己的脚本并使用scrapy作为API,请参阅[从脚本运行scrapy](https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script),然后您需要配置自己记录。

    在蜘蛛初始化之前,上述格式化程序将无效。这是一些印刷品:

    2017-10-03 11:59:39 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
    2017-10-03 11:59:39 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
    2017-10-03 11:59:39 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.logstats.LogStats']
    2017-10-03 11:59:39 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',  'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',  'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',  'scrapy.downloadermiddlewares.retry.RetryMiddleware',  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',  'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',  'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',  'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2017-10-03 11:59:39 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',  'scrapy.spidermiddlewares.referer.RefererMiddleware',  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',  'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2017-10-03 11:59:39 [scrapy.middleware] INFO: Enabled item pipelines: []
    2017-10-03 11:59:39 [scrapy.core.engine] INFO: Spider opened
    2017-10-03 11:59:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2017-10-03 11:59:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
    2017-10-03 11:59:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.somenxdomain.com/robots.txt> (failed 1 times): DNS lookup failed: no results for hostname lookup: www.somenxdomain.com.
    2017-10-03 11:59:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.somenxdomain.com/robots.txt> (failed 2 times): DNS lookup failed: no results for hostname lookup: www.somenxdomain.com.
    2017-10-03 11:59:39 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.somenxdomain.com/robots.txt> (failed 3 times): DNS lookup failed: no results for hostname lookup: www.somenxdomain.com.
    2017-10-03 11:59:39 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.somenxdomain.com/robots.txt> Traceback (most recent call last):   File "/Users/xxx/anaconda/envs/p3/lib/python3.6/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks     result = result.throwExceptionIntoGenerator(g)   File "/Users/xxx/anaconda/envs/p3/lib/python3.6/site-packages/twisted/python/failure.py", line 393, in throwExceptionIntoGenerator     return g.throw(self.type, self.value, self.tb)   File "/Users/xxx/anaconda/envs/p3/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request     defer.returnValue((yield download_func(request=request,spider=spider)))   File "/Users/xxx/anaconda/envs/p3/lib/python3.6/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks     current.result = callback(current.result, *args, **kw)   File "/Users/xxx/anaconda/envs/p3/lib/python3.6/site-packages/twisted/internet/endpoints.py", line 954, in startConnectionAttempts     "no results for hostname lookup: {}".format(self._hostStr) twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: www.somenxdomain.com.
    2017-10-03 11:59:40 [scrapy.core.engine] INFO: Closing spider (finished)
    2017-10-03 11:59:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/exception_count': 3,  'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 3,  'downloader/request_bytes': 684,  'downloader/request_count': 3,  'downloader/request_method_count/GET': 3,  'finish_reason': 'finished',  'finish_time': datetime.datetime(2017, 10, 3, 15, 59, 40, 46636),  'log_count/DEBUG': 4,  'log_count/ERROR': 1,  'log_count/INFO': 7,  'scheduler/dequeued': 3,  'scheduler/dequeued/memory': 3,  'scheduler/enqueued': 3,  'scheduler/enqueued/memory': 3,  'start_time': datetime.datetime(2017, 10, 3, 15, 59, 39, 793795)}
    2017-10-03 11:59:40 [scrapy.core.engine] INFO: Spider closed (finished)
    

    你可以看到蜘蛛运行后,所有消息都被格式化为一行。 (删除'\n')。

    希望这会有所帮助。