我在一个项目中使用了Scrapy,我有自己的json日志格式。
我想避免Scrapy中的任何多行堆栈跟踪,尤其是来自robots.txt的中间件。我希望它是一个合适的单行错误或整个堆栈跟踪捆绑到一条消息中。
如何禁用或覆盖此日志记录行为?以下是我从robots.txt的下载中间件获得的一个示例堆栈跟踪
2017-10-03 19:08:57 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://www.somedomain.com/robots.txt>: DNS lookup failed: no results for hostname lookup: www.somedomain.com. Traceback (most recent call last): File "/Users/auser/.virtualenvs/myenv/lib/python3.5/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g) File "/Users/auser/.virtualenvs/myenv/lib/python3.5/site-packages/twisted/python/failure.py", line 393, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb) File "/Users/auser/.virtualenvs/myenv/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider))) File "/Users/auser/.virtualenvs/myenv/lib/python3.5/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw) File "/Users/auser/.virtualenvs/myenv/lib/python3.5/site-packages/twisted/internet/endpoints.py", line 954, in startConnectionAttempts
"no results for hostname lookup: {}".format(self._hostStr) twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: www.somedomain.com.
答案 0 :(得分:1)
我不知道为什么不喜欢错误信息是多行(这是异常追溯的打印)。无论如何,我们可以自定义scrapy日志记录的格式。假设您在scrapy
命令行运行抓取脚本,例如scrapy crawl
或scrapy runspider
。下面是一个示例代码(python 3版本),展示了如何使用自己的格式化程序。
import logging
import scrapy
class OneLineFormatter(logging.Formatter):
def __init__(self, *args, **kwargs):
super(OneLineFormatter, self).__init__(*args, **kwargs)
def format(self, record):
formatted = super(OneLineFormatter, self).format(record)
return formatted.replace('\n', ' ')
class TestSpider(scrapy.Spider):
name = "test"
start_urls = [
'http://www.somenxdomain.com/robots.txt',
]
def __init__(self, fmt, datefmt, *args, **kwargs):
my_formatter = OneLineFormatter(fmt=fmt, datefmt=datefmt)
root = logging.getLogger()
for h in root.handlers:
h.setFormatter(my_formatter)
super(TestSpider, self).__init__(*args, **kwargs)
@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
return cls(settings.get('LOG_FORMAT'), settings.get('LOG_DATEFORMAT'))
def parse(self, response):
pass
以下是一些解释。
Python记录工作流程。 scrapy
本身使用python内置日志系统。因此,您需要一些python日志记录的基本知识,尤其是Logger
,Handler
,Filter
和Formatter
类之间的关系。我强烈建议working flow的python日志记录。
Scrapy日志记录和设置。如果您的蜘蛛由scrapy
命令行运行,例如scrapy crawl
或scrapy runspider
,则会调用scrapy函数[configure_logging](https://docs.python.org/2/howto/logging.html#logging-flow)
来初始化日志记录。 scrapy logging的说明可以提供有关如何自定义日志记录的说明,并scrapy settings可以访问您的设置。
示例代码的工作原理。基本工作流程是:
root
记录器并将格式化程序设置为root
的所有处理程序。如果您编写自己的脚本并使用scrapy作为API,请参阅[从脚本运行scrapy](https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script),然后您需要配置自己记录。
在蜘蛛初始化之前,上述格式化程序将无效。这是一些印刷品:
2017-10-03 11:59:39 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-10-03 11:59:39 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2017-10-03 11:59:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2017-10-03 11:59:39 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-10-03 11:59:39 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-10-03 11:59:39 [scrapy.middleware] INFO: Enabled item pipelines: []
2017-10-03 11:59:39 [scrapy.core.engine] INFO: Spider opened
2017-10-03 11:59:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-10-03 11:59:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2017-10-03 11:59:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.somenxdomain.com/robots.txt> (failed 1 times): DNS lookup failed: no results for hostname lookup: www.somenxdomain.com.
2017-10-03 11:59:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.somenxdomain.com/robots.txt> (failed 2 times): DNS lookup failed: no results for hostname lookup: www.somenxdomain.com.
2017-10-03 11:59:39 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.somenxdomain.com/robots.txt> (failed 3 times): DNS lookup failed: no results for hostname lookup: www.somenxdomain.com.
2017-10-03 11:59:39 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.somenxdomain.com/robots.txt> Traceback (most recent call last): File "/Users/xxx/anaconda/envs/p3/lib/python3.6/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks result = result.throwExceptionIntoGenerator(g) File "/Users/xxx/anaconda/envs/p3/lib/python3.6/site-packages/twisted/python/failure.py", line 393, in throwExceptionIntoGenerator return g.throw(self.type, self.value, self.tb) File "/Users/xxx/anaconda/envs/p3/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request defer.returnValue((yield download_func(request=request,spider=spider))) File "/Users/xxx/anaconda/envs/p3/lib/python3.6/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/Users/xxx/anaconda/envs/p3/lib/python3.6/site-packages/twisted/internet/endpoints.py", line 954, in startConnectionAttempts "no results for hostname lookup: {}".format(self._hostStr) twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: www.somenxdomain.com.
2017-10-03 11:59:40 [scrapy.core.engine] INFO: Closing spider (finished)
2017-10-03 11:59:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/exception_count': 3, 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 3, 'downloader/request_bytes': 684, 'downloader/request_count': 3, 'downloader/request_method_count/GET': 3, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 10, 3, 15, 59, 40, 46636), 'log_count/DEBUG': 4, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3, 'scheduler/enqueued': 3, 'scheduler/enqueued/memory': 3, 'start_time': datetime.datetime(2017, 10, 3, 15, 59, 39, 793795)}
2017-10-03 11:59:40 [scrapy.core.engine] INFO: Spider closed (finished)
你可以看到蜘蛛运行后,所有消息都被格式化为一行。 (删除'\n'
)。
希望这会有所帮助。