ERROR: Error downloading <GET URL_HERE>: User timeout caused connection failure.
我在使用刮刀时偶尔会遇到这个问题。有没有办法可以解决这个问题并在它发生时运行一个函数?我无法在任何地方找到如何做到这一点。
答案 0 :(得分:26)
您可以做的是在Request
个实例中定义errback
:
errback(callable) - 在处理请求时引发任何异常时将调用的函数。这包括因404 HTTP错误而失败的页面等。它接收a Twisted Failure instance作为第一个参数。
以下是您可以使用的一些示例代码(对于scrapy 1.0):
# -*- coding: utf-8 -*-
# errbacks.py
import scrapy
# from scrapy.contrib.spidermiddleware.httperror import HttpError
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError
class ErrbackSpider(scrapy.Spider):
name = "errbacks"
start_urls = [
"http://www.httpbin.org/", # HTTP 200 expected
"http://www.httpbin.org/status/404", # Not found error
"http://www.httpbin.org/status/500", # server issue
"http://www.httpbin.org:12345/", # non-responding host, timeout expected
"http://www.httphttpbinbin.org/", # DNS error expected
]
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
dont_filter=True)
def parse_httpbin(self, response):
self.logger.error('Got successful response from {}'.format(response.url))
# do something useful now
def errback_httpbin(self, failure):
# log all errback failures,
# in case you want to do something special for some errors,
# you may need the failure's type
self.logger.error(repr(failure))
#if isinstance(failure.value, HttpError):
if failure.check(HttpError):
# you can get the response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
#elif isinstance(failure.value, DNSLookupError):
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
#elif isinstance(failure.value, TimeoutError):
elif failure.check(TimeoutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
scrapy shell中的输出(只有1次重试和5次下载超时):
$ scrapy runspider errbacks.py --set DOWNLOAD_TIMEOUT=5 --set RETRY_TIMES=1
2015-06-30 23:45:55 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-30 23:45:55 [scrapy] INFO: Optional features available: ssl, http11
2015-06-30 23:45:55 [scrapy] INFO: Overridden settings: {'DOWNLOAD_TIMEOUT': '5', 'RETRY_TIMES': '1'}
2015-06-30 23:45:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-06-30 23:45:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-30 23:45:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-30 23:45:56 [scrapy] INFO: Enabled item pipelines:
2015-06-30 23:45:56 [scrapy] INFO: Spider opened
2015-06-30 23:45:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-30 23:45:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httphttpbinbin.org/> (failed 1 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname.
2015-06-30 23:45:56 [scrapy] DEBUG: Gave up retrying <GET http://www.httphttpbinbin.org/> (failed 2 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname.
2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.DNSLookupError'>>
2015-06-30 23:45:56 [errbacks] ERROR: DNSLookupError on http://www.httphttpbinbin.org/
2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (200) <GET http://www.httpbin.org/> (referer: None)
2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (404) <GET http://www.httpbin.org/status/404> (referer: None)
2015-06-30 23:45:56 [errbacks] ERROR: Got successful response from http://www.httpbin.org/
2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>>
2015-06-30 23:45:56 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/404
2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 1 times): 500 Internal Server Error
2015-06-30 23:45:57 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org/status/500> (failed 2 times): 500 Internal Server Error
2015-06-30 23:45:57 [scrapy] DEBUG: Crawled (500) <GET http://www.httpbin.org/status/500> (referer: None)
2015-06-30 23:45:57 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>>
2015-06-30 23:45:57 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/500
2015-06-30 23:46:01 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org:12345/> (failed 1 times): User timeout caused connection failure.
2015-06-30 23:46:06 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org:12345/> (failed 2 times): User timeout caused connection failure.
2015-06-30 23:46:06 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.TimeoutError'>>
2015-06-30 23:46:06 [errbacks] ERROR: TimeoutError on http://www.httpbin.org:12345/
2015-06-30 23:46:06 [scrapy] INFO: Closing spider (finished)
2015-06-30 23:46:06 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 4,
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
'downloader/request_bytes': 1748,
'downloader/request_count': 8,
'downloader/request_method_count/GET': 8,
'downloader/response_bytes': 12506,
'downloader/response_count': 4,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'downloader/response_status_count/500': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 6, 30, 21, 46, 6, 537191),
'log_count/DEBUG': 10,
'log_count/ERROR': 9,
'log_count/INFO': 7,
'response_received_count': 3,
'scheduler/dequeued': 8,
'scheduler/dequeued/memory': 8,
'scheduler/enqueued': 8,
'scheduler/enqueued/memory': 8,
'start_time': datetime.datetime(2015, 6, 30, 21, 45, 56, 322235)}
2015-06-30 23:46:06 [scrapy] INFO: Spider closed (finished)
请注意scrapy如何在其统计信息中记录异常:
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
答案 1 :(得分:0)
我更喜欢这样的自定义重试中间件:
from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware
from fake_useragent import FakeUserAgentError
class FakeUserAgentErrorRetryMiddleware(RetryMiddleware):
def process_exception(self, request, exception, spider):
if type(exception) == FakeUserAgentError: return self._retry(request, exception, spider)