我正在使用以下方法检查spider.py
中的(互联网)连接错误:
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, errback=self.handle_error)
def handle_error(self, failure):
if failure.check(DNSLookupError): # or failure.check(UnknownHostError):
request = failure.request
self.logger.error('DNSLookupError on: %s', request.url)
print("\nDNS Error! Please check your internet connection!\n")
elif failure.check(HttpError):
response = failure.value.response
self.logger.error('HttpError on: %s', response.url)
print('\nSpider closed because of Connection issues!\n')
raise CloseSpider('Because of Connection issues!')
...
但是,当蜘蛛运行并且连接断开时,我仍然收到 Traceback (most recent call last):
消息。我想通过处理错误并正确关闭Spider来摆脱这种情况。
我得到的输出是:
2018-10-11 12:52:15 [NewAds] ERROR: DNSLookupError on: https://x.com
DNS Error! Please check your internet connection!
2018-10-11 12:52:15 [scrapy.core.scraper] ERROR: Error downloading <GET https://x.com>
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/lib/python3.6/site-packages/twisted/python/failure.py", line 408, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "/usr/lib/python3.6/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/python3.6/site-packages/twisted/internet/endpoints.py", line 954, in startConnectionAttempts
"no results for hostname lookup: {}".format(self._hostStr)
twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: x.com.
由此您可以注意到以下内容:
DNSLookupError
错误,但是... ERROR: Error downloading
)。twisted.internet.error.DNSLookupError:
? 如何处理[scrapy.core.scraper] ERROR: Error downloading
并确保蜘蛛网正确关闭?
(或:如何在Spider启动时检查互联网连接?)
答案 0 :(得分:0)
好,我一直在尝试与Scrapy融洽相处,试图退出
如果没有互联网连接或其他错误,请正常显示。结果?我无法使其正常工作。相反,我最终只是使用 os._exit(0)
关闭了整个口译员和所有令人讨厌的延迟孩子,如下所示:
import socket
#from scrapy.exceptions import CloseSpider
...
def check_connection(self):
try:
socket.create_connection(("www.google.com", 443))
return True
except:
pass
return False
def start_requests(self):
if not self.check_connection():
print('Connection Lost! Please check your internet connection!', flush=True)
os._exit(0) # Kill Everything
#CloseSpider('Grace Me!') # Close clean but expect deferred errors!
#raise CloseSpider('No Grace') # Raise Exception (w. Traceback)?!
...
做到了!
注意
我试图使用各种内部方法关闭Scrapy,并处理令人讨厌的事情:
[scrapy.core.scraper] ERROR: Error downloading
问题。当您使用:raise CloseSpider('Because of Connection issues!')
和其他许多尝试时,只会发生(?)。再加上一个twisted.internet.error.DNSLookupError
,即使我已经用 my 代码进行了处理,但它似乎也不是没有。显然,raise
是手动always raise an exception的方式。因此,请改用CloseSpider()
。
当前的问题似乎也是Scrapy框架中经常发生的问题……实际上,源代码中包含一些FIXME的in there。即使我尝试应用类似的内容:
def stop(self):
self.deferred = defer.Deferred()
for name, signal in vars(signals).items():
if not name.startswith('_'):
disconnect_all(signal)
self.deferred.callback(None)
并使用这些...
#self.stop()
#sys.exit()
#disconnect_all(signal, **kwargs)
#self.crawler.engine.close_spider(spider, 'cancelled')
#scrapy.crawler.CrawlerRunner.stop()
#crawler.signals.stop()
PS。如果Scrapy开发人员能够记录如何最好地处理这种简单情况(例如没有互联网连接),那将是很棒的事情。
答案 1 :(得分:0)
我相信我可能刚刚找到答案。要正常退出start_requests,请return []
。这表明它没有要处理的请求。
要关闭蜘蛛,请在蜘蛛上调用close()方法: self.close('原因')
import logging
import scrapy
import socket
class SpiderIndex(scrapy.Spider):
name = 'test'
def check_connection(self):
try:
socket.create_connection(("www.google.com", 443))
return True
except Exception:
pass
return False
def start_requests(self):
if not self.check_connection():
print('Connection Lost! Please check your internet connection!', flush=True)
self.close(self, 'Connection Lost!')
return []
# Continue as normal ...
request = scrapy.Request(url='https://www.google.com', callback=self.parse)
yield request
def parse(self, response):
self.log(f'===TEST SPIDER: PARSE REQUEST======{response.url}===========', logging.INFO)
附录:出于某种奇怪的原因,在一只蜘蛛self.close('reason')
上工作时,我不得不将其更改为self.close(self, 'reason')
。
答案 2 :(得分:0)
twist.def有一个类似的问题,在尝试关闭扭曲的连接后会捕获异常,这会阻止代码正常关闭。
所以,我弹出核心...
os._exit(0)