Scrapy错误信号 - 在刮擦时在超时(或其他NW问题)时不提供任何信息

时间:2017-08-01 14:22:58

标签: scrapy scrapy-spider

我正在进行一些抓取(Scrapy 1.3.3),并使用spider_openedspider_closed的Scrapy信号验证抓取是否成功/不成功。

spider_closed

的部分extensions.py代码
logger = logging.getLogger(__name__)
class SendEmail(object):

def __init__(self):
    self.fromaddr = FROMADDR
    self.toaddr  = TOADDR

@classmethod
def from_crawler(cls, crawler):
    # first check if the extension should be enabled and raise
    # NotConfigured otherwise
    if not crawler.settings.getbool('MYEXT_ENABLED'):
        raise NotConfigured

    # instantiate the extension object
    ext = cls()
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)

def spider_closed(self, spider, reason):
    logger.info("closed spider %s", spider.name)
    pipelines.active_scrape.scrape_end = datetime.datetime.now()
    if reason == "finished":
        pipelines.active_scrape.scrape_status = "Finished"
        pipelines.active_scrape.scrape_status_reason = reason
    elif reason == "cancelled":
        pipelines.active_scrape.scrape_status = "Failed"
        pipelines.active_scrape.scrape_status_reason = reason
    elif reason == "shutdown":
        pipelines.active_scrape.scrape_status = "Failed"
        pipelines.active_scrape.scrape_status_reason = reason

Settings.py - 仅取消注释文字

BOT_NAME = 'testna'

SPIDER_MODULES = ['testna.spiders']
NEWSPIDER_MODULE = 'testna.spiders'

ROBOTSTXT_OBEY = True

DOWNLOAD_DELAY = 5

ITEM_PIPELINES = {
    'testna.pipelines.TestnaPipeline': 300,
}

MYEXT_ENABLED = True
EXTENSIONS = {
    'scrapy.extensions.telnet.TelnetConsole': None,
    'testna.extensions.SendEmail':500
}

蜘蛛 - 删除了实际数据并替换为xyz

import scrapy
import re
from decimal import Decimal

class TestnaSpider(scrapy.Spider):
    name = "testna"

    def start_requests(self):
        urls = [
        'http://www.xyz.example',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        xs = response.css("#xyz")
        for x in xs:
            URL = 'http://www.xyz.example'
            ID = x.css("xyz")
            ID2 = x.css("xyz")

            for ID, naziv in zip(ID, ID2):
                yield scrapy.Request(url=URL+ID, callback=lambda request,naziv=naziv: self.parse_x(naziv,request))

    def parse_x(self, id_x, response):

        xs = response.css("xyz") 
        for x in xs:

            desc1 = ''.join(x.css(".xyz::text").extract()) 
            desc2 = ''.join(x.css("xyz::text").extract())

            yield {"description1": desc1, "description2": descr2, "ID": id_x}

        NEXT_PAGE_SELECTOR = ".Pagination-item--next a::attr(href)"
        next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
        if next_page:
            yield scrapy.Request(
                response.urljoin(next_page),
                callback=lambda request,id_x=id_x: self.parse_x(id_x,request)
            )

当潜在的网络问题发生并且Scrapy无法获取网络数据(无路由到主机,DNS查找失败 - 其中一些我遇到过),导致重试并关闭蜘蛛,Scrapy信号蜘蛛完成,没有任何问题描述。

日志:

2017-08-01 14:47:31 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.xyz.example> (failed 1 times): An error occurred while connecting: 113: No route to host.
2017-08-01 14:47:31 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.xyz.example> (failed 2 times): An error occurred while connecting: 113: No route to host.
2017-08-01 14:47:37 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.xyz.example> (failed 3 times): An error occurred while connecting: 113: No route to host.
2017-08-01 14:47:37 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.xyz.example>: An error occurred while connecting: 113: No route to host.
2017-08-01 14:47:56 [scrapy.extensions.logstats] INFO: Crawled 4 pages (at 4 pages/min), scraped 49 items (at 49 items/min)
2017-08-01 14:48:56 [scrapy.extensions.logstats] INFO: Crawled 4 pages (at 0 pages/min), scraped 49 items (at 0 items/min)
2017-08-01 14:49:56 [scrapy.extensions.logstats] INFO: Crawled 4 pages (at 0 pages/min), scraped 49 items (at 0 items/min)
2017-08-01 14:50:21 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.xyz.example> (failed 1 times): User timeout caused connection failure: Getting http://www.xyz.example took longer than 180.0 seconds..
2017-08-01 14:50:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://www.xyz.example> (failed 2 times): An error occurred while connecting: 113: No route to host.
2017-08-01 14:50:29 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://www.xyz.example> (failed 3 times): An error occurred while connecting: 113: No route to host.
2017-08-01 14:50:29 [scrapy.core.scraper] ERROR: Error downloading <GET http://www.xyz.example>: An error occurred while connecting: 113: No route to host.
2017-08-01 14:50:29 [scrapy.core.engine] INFO: Closing spider (finished)
2017-08-01 14:50:29 [testna.extensions] INFO: closed spider testna

如果没有导致像上面一样关闭蜘蛛的NW问题,我如何验证整个刮擦是否正常?

我尝试了所有当前的信号,但没有一个与潜在的NW问题有任何关联。

1 个答案:

答案 0 :(得分:0)

让我们继续,并且都同意“113”理论上不是回应。毕竟,它不是来自服务器/网站的初始请求被发送,所以使用本机信令将不会这样做,因为我们只是同意它不是真正的响应。

An error occurred while connecting: 113: No route to host.

这可能是由于很多原因引起的,重点是:一旦扭曲尝试解决,它就像物理“到达”或可以“触摸它”。我会坚持这个问题,但如果你有几分钟的时间,read this。很高兴知道这些东西,但你可以直接跳到任何DHCP和ARP分层提及。

网络

如何调试这个东西?我建议您不要忘记并创建自定义记录器扩展!但是你只能得到相同的信息(假设你还启用了“深度”中间件,并且日志级别很关键,但是,不是很多)。我从内到外开始。

所以你想模拟这个事件,因为它在日常场景中发生了什么:记录它,并在文本文件中详细地输出​​它?直达终点?都?有一个扩展!

enter image description here

无论您是登录文件还是输出,Stacktracedump都可能是您正在寻找的东西。从理论上讲,您可以在出现“113”错误时创建信号,这可能会触发您自己设计的功能。

你问:

  

如果没有导致像上面一样关闭蜘蛛的NW问题,我如何验证整个刮擦是否正常?

嗯,除了明确的堆栈跟踪之前,MAX TIME等待扼杀之前,我可以想一些很酷的事情来处理来自蜘蛛实例的stdout的“自定义信号”。

由于你已经使用了telnet,如何暂停蜘蛛,然后发出检查HTTP状态的请求 - 点击200!取消暂停和/或推送某种通知。

等待!由于本地网络问题导致错误,我们无法发送电子邮件?那么,你可以使用你的局域网通过电子邮件服务器或套接字推送某种通知

我知道这是一个很多的定制和额外的开发,但至于你的问题,验证刮痧是否顺利!观察133 ... stacktracedump,虽然不是一个信号,你可以创建一个触发功能,如果它出现。如果它没有出现,我想这是一个成功的刮!