我正在进行一些抓取(Scrapy 1.3.3),并使用spider_opened
和spider_closed
的Scrapy信号验证抓取是否成功/不成功。
spider_closed
logger = logging.getLogger(__name__)
class SendEmail(object):
def __init__(self):
self.fromaddr = FROMADDR
self.toaddr = TOADDR
@classmethod
def from_crawler(cls, crawler):
# first check if the extension should be enabled and raise
# NotConfigured otherwise
if not crawler.settings.getbool('MYEXT_ENABLED'):
raise NotConfigured
# instantiate the extension object
ext = cls()
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
def spider_closed(self, spider, reason):
logger.info("closed spider %s", spider.name)
pipelines.active_scrape.scrape_end = datetime.datetime.now()
if reason == "finished":
pipelines.active_scrape.scrape_status = "Finished"
pipelines.active_scrape.scrape_status_reason = reason
elif reason == "cancelled":
pipelines.active_scrape.scrape_status = "Failed"
pipelines.active_scrape.scrape_status_reason = reason
elif reason == "shutdown":
pipelines.active_scrape.scrape_status = "Failed"
pipelines.active_scrape.scrape_status_reason = reason
Settings.py - 仅取消注释文字
BOT_NAME = 'testna'
SPIDER_MODULES = ['testna.spiders']
NEWSPIDER_MODULE = 'testna.spiders'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 5
ITEM_PIPELINES = {
'testna.pipelines.TestnaPipeline': 300,
}
MYEXT_ENABLED = True
EXTENSIONS = {
'scrapy.extensions.telnet.TelnetConsole': None,
'testna.extensions.SendEmail':500
}
蜘蛛 - 删除了实际数据并替换为xyz
import scrapy
import re
from decimal import Decimal
class TestnaSpider(scrapy.Spider):
name = "testna"
def start_requests(self):
urls = [
'http://www.xyz.example',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
xs = response.css("#xyz")
for x in xs:
URL = 'http://www.xyz.example'
ID = x.css("xyz")
ID2 = x.css("xyz")
for ID, naziv in zip(ID, ID2):
yield scrapy.Request(url=URL+ID, callback=lambda request,naziv=naziv: self.parse_x(naziv,request))
def parse_x(self, id_x, response):
xs = response.css("xyz")
for x in xs:
desc1 = ''.join(x.css(".xyz::text").extract())
desc2 = ''.join(x.css("xyz::text").extract())
yield {"description1": desc1, "description2": descr2, "ID": id_x}
NEXT_PAGE_SELECTOR = ".Pagination-item--next a::attr(href)"
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=lambda request,id_x=id_x: self.parse_x(id_x,request)
)
当潜在的网络问题发生并且Scrapy无法获取网络数据(无路由到主机,DNS查找失败 - 其中一些我遇到过),导致重试并关闭蜘蛛,Scrapy信号蜘蛛完成,没有任何问题描述。
日志:
2017-08-01 14:47:31 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.xyz.example> (failed 1 times): An error occurred while connecting: 113: No route to host.
2017-08-01 14:47:31 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.xyz.example> (failed 2 times): An error occurred while connecting: 113: No route to host.
2017-08-01 14:47:37 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.xyz.example> (failed 3 times): An error occurred while connecting: 113: No route to host.
2017-08-01 14:47:37 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.xyz.example>: An error occurred while connecting: 113: No route to host.
2017-08-01 14:47:56 [scrapy.extensions.logstats] INFO: Crawled 4 pages (at 4 pages/min), scraped 49 items (at 49 items/min)
2017-08-01 14:48:56 [scrapy.extensions.logstats] INFO: Crawled 4 pages (at 0 pages/min), scraped 49 items (at 0 items/min)
2017-08-01 14:49:56 [scrapy.extensions.logstats] INFO: Crawled 4 pages (at 0 pages/min), scraped 49 items (at 0 items/min)
2017-08-01 14:50:21 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.xyz.example> (failed 1 times): User timeout caused connection failure: Getting http://www.xyz.example took longer than 180.0 seconds..
2017-08-01 14:50:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.xyz.example> (failed 2 times): An error occurred while connecting: 113: No route to host.
2017-08-01 14:50:29 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.xyz.example> (failed 3 times): An error occurred while connecting: 113: No route to host.
2017-08-01 14:50:29 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.xyz.example>: An error occurred while connecting: 113: No route to host.
2017-08-01 14:50:29 [scrapy.core.engine] INFO: Closing spider (finished)
2017-08-01 14:50:29 [testna.extensions] INFO: closed spider testna
如果没有导致像上面一样关闭蜘蛛的NW问题,我如何验证整个刮擦是否正常?
我尝试了所有当前的信号,但没有一个与潜在的NW问题有任何关联。
答案 0 :(得分:0)
让我们继续,并且都同意“113”理论上不是回应。毕竟,它不是来自服务器/网站的初始请求被发送,所以使用本机信令将不会这样做,因为我们只是同意它不是真正的响应。
An error occurred while connecting: 113: No route to host.
这可能是由于很多原因引起的,重点是:一旦扭曲尝试解决,它就像物理“到达”或可以“触摸它”。我会坚持这个问题,但如果你有几分钟的时间,read this。很高兴知道这些东西,但你可以直接跳到任何DHCP和ARP分层提及。
如何调试这个东西?我建议您不要忘记并创建自定义记录器扩展!但是你只能得到相同的信息(假设你还启用了“深度”中间件,并且日志级别很关键,但是,不是很多)。我从内到外开始。
所以你想模拟这个事件,因为它在日常场景中发生了什么:记录它,并在文本文件中详细地输出它?直达终点?都?有一个扩展!
无论您是登录文件还是输出,Stacktracedump都可能是您正在寻找的东西。从理论上讲,您可以在出现“113”错误时创建信号,这可能会触发您自己设计的功能。
你问:
如果没有导致像上面一样关闭蜘蛛的NW问题,我如何验证整个刮擦是否正常?
嗯,除了明确的堆栈跟踪之前,MAX TIME等待扼杀之前,我可以想一些很酷的事情来处理来自蜘蛛实例的stdout的“自定义信号”。
由于你已经使用了telnet,如何暂停蜘蛛,然后发出检查HTTP状态的请求 - 点击200!取消暂停和/或推送某种通知。
等待!由于本地网络问题导致错误,我们无法发送电子邮件?那么,你可以使用你的局域网通过电子邮件服务器或套接字推送某种通知
我知道这是一个很多的定制和额外的开发,但至于你的问题,验证刮痧是否顺利!观察133 ... stacktracedump,虽然不是一个信号,你可以创建一个触发功能,如果它出现。如果它没有出现,我想这是一个成功的刮!