Scrapy spider不使用CloseSpider扩展名终止

时间:2017-06-15 11:19:38

标签: python python-3.x scrapy scrapy-spider

我已经设置了一个Scrapy蜘蛛,它可以解析xml提要,处理大约20,000条记录。

出于开发目的,我想限制处理的项目数量。通过阅读我发现的Scrapy文档,我需要使用CloseSpider扩展名。

我已经按照如何启用它的指南 - 在我的蜘蛛配置中我有以下内容:

CLOSESPIDER_ITEMCOUNT = 1
EXTENSIONS = {
    'scrapy.extensions.closespider.CloseSpider': 500,
}

然而,我的蜘蛛永远不会终止 - 我知道CONCURRENT_REQUESTS设置会影响蜘蛛实际终止的时间(因为它会继续处理每个并发请求),但这只是设置为默认值16,但我的蜘蛛将继续处理所有物品。

我尝试使用CLOSESPIDER_TIMEOUT设置,但同样没有效果。

这是一些调试信息,从我运行蜘蛛时开始:

2017-06-15 12:14:11 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: myscraper)
2017-06-15 12:14:11 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'myscraper', 'CLOSESPIDER_ITEMCOUNT': 1, 'FEED_URI': 'file:///tmp/myscraper/export.jsonl', 'NEWSPIDER_MODULE': 'myscraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['myscraper.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.closespider.CloseSpider']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled item pipelines:
['myscraper.pipelines.MyScraperPipeline']
2017-06-15 12:14:11 [scrapy.core.engine] INFO: Spider opened

可以看出,正在应用CloseSpider扩展程序和CLOSESPIDER_ITEMCOUNT设置。

为什么这不起作用?

2 个答案:

答案 0 :(得分:2)

我提出了一个解决方案帮助parik的答案,以及我自己的研究。它确实有一些无法解释的行为,我将介绍(评论赞赏)。

在我的蜘蛛网myspider_spider.py文件中,我(为了简洁而编辑):

import scrapy
from scrapy.spiders import XMLFeedSpider
from scrapy.exceptions import CloseSpider
from myspiders.items import MySpiderItem

class MySpiderSpider(XMLFeedSpider):
    name = "myspiders"
    allowed_domains = {"www.mysource.com"}
    start_urls = [
        "https://www.mysource.com/source.xml"
        ]
    iterator = 'iternodes'
    itertag = 'item'
    item_count = 0

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        return cls(settings)

    def __init__(self, settings):
        self.settings = settings

    def parse_node(self, response, node):
        if(self.settings['CLOSESPIDER_ITEMCOUNT'] and int(self.settings['CLOSESPIDER_ITEMCOUNT']) == self.item_count):
            raise CloseSpider('CLOSESPIDER_ITEMCOUNT limit reached - ' + str(self.settings['CLOSESPIDER_ITEMCOUNT']))
        else:
            self.item_count += 1
        id = node.xpath('id/text()').extract()
        title = node.xpath('title/text()').extract()
        item = MySpiderItem()
        item['id'] = id
        item['title'] = title

        return item

这样做 - 如果我将CLOSESPIDER_ITEMCOUNT设置为10,它会在处理10个项目后终止(因此,在这方面它似乎忽略了CONCURRENT_REQUESTS - 这是意料之外的。)

我在settings.py

中对此进行了评论
#EXTENSIONS = {
#   'scrapy.extensions.closespider.CloseSpider': 500,
#}

所以,它只是使用CloseSpider例外。但是,日志显示以下内容:

2017-06-16 10:04:15 [scrapy.core.engine] INFO: Closing spider (closespider_itemcount)
2017-06-16 10:04:15 [scrapy.extensions.feedexport] INFO: Stored jsonlines feed (10 items) in: file:///tmp/myspiders/export.jsonl
2017-06-16 10:04:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 600,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 8599860,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'closespider_itemcount',
 'finish_time': datetime.datetime(2017, 6, 16, 9, 4, 15, 615501),
 'item_scraped_count': 10,
 'log_count/DEBUG': 8,
 'log_count/INFO': 8,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 6, 16, 9, 3, 47, 966791)}
2017-06-16 10:04:15 [scrapy.core.engine] INFO: Spider closed (closespider_itemcount)

要突出显示第一行INFOfinish_reason的关键事项 - INFO下显示的消息不是我提升{{1}时设置的消息异常。这意味着CloseSpider延伸会阻止蜘蛛,但我知道它不是吗?非常混乱。

答案 1 :(得分:1)

您还可以使用 CloseSpider 例外来限制商品数量,

注意, 在蜘蛛回调 CloseSpider 例外仅支持

你可以在documentation

中看到

  

可以从蜘蛛回调中引发此异常以请求   蜘蛛被关闭/停止。支持的参数:

some examples