我已经设置了一个Scrapy蜘蛛,它可以解析xml提要,处理大约20,000条记录。
出于开发目的,我想限制处理的项目数量。通过阅读我发现的Scrapy文档,我需要使用CloseSpider扩展名。
我已经按照如何启用它的指南 - 在我的蜘蛛配置中我有以下内容:
CLOSESPIDER_ITEMCOUNT = 1
EXTENSIONS = {
'scrapy.extensions.closespider.CloseSpider': 500,
}
然而,我的蜘蛛永远不会终止 - 我知道CONCURRENT_REQUESTS
设置会影响蜘蛛实际终止的时间(因为它会继续处理每个并发请求),但这只是设置为默认值16,但我的蜘蛛将继续处理所有物品。
我尝试使用CLOSESPIDER_TIMEOUT
设置,但同样没有效果。
这是一些调试信息,从我运行蜘蛛时开始:
2017-06-15 12:14:11 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: myscraper)
2017-06-15 12:14:11 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'myscraper', 'CLOSESPIDER_ITEMCOUNT': 1, 'FEED_URI': 'file:///tmp/myscraper/export.jsonl', 'NEWSPIDER_MODULE': 'myscraper.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['myscraper.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.closespider.CloseSpider']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-15 12:14:11 [scrapy.middleware] INFO: Enabled item pipelines:
['myscraper.pipelines.MyScraperPipeline']
2017-06-15 12:14:11 [scrapy.core.engine] INFO: Spider opened
可以看出,正在应用CloseSpider
扩展程序和CLOSESPIDER_ITEMCOUNT
设置。
为什么这不起作用?
答案 0 :(得分:2)
我提出了一个解决方案帮助parik的答案,以及我自己的研究。它确实有一些无法解释的行为,我将介绍(评论赞赏)。
在我的蜘蛛网myspider_spider.py
文件中,我(为了简洁而编辑):
import scrapy
from scrapy.spiders import XMLFeedSpider
from scrapy.exceptions import CloseSpider
from myspiders.items import MySpiderItem
class MySpiderSpider(XMLFeedSpider):
name = "myspiders"
allowed_domains = {"www.mysource.com"}
start_urls = [
"https://www.mysource.com/source.xml"
]
iterator = 'iternodes'
itertag = 'item'
item_count = 0
@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
return cls(settings)
def __init__(self, settings):
self.settings = settings
def parse_node(self, response, node):
if(self.settings['CLOSESPIDER_ITEMCOUNT'] and int(self.settings['CLOSESPIDER_ITEMCOUNT']) == self.item_count):
raise CloseSpider('CLOSESPIDER_ITEMCOUNT limit reached - ' + str(self.settings['CLOSESPIDER_ITEMCOUNT']))
else:
self.item_count += 1
id = node.xpath('id/text()').extract()
title = node.xpath('title/text()').extract()
item = MySpiderItem()
item['id'] = id
item['title'] = title
return item
这样做 - 如果我将CLOSESPIDER_ITEMCOUNT
设置为10,它会在处理10个项目后终止(因此,在这方面它似乎忽略了CONCURRENT_REQUESTS
- 这是意料之外的。)
我在settings.py
:
#EXTENSIONS = {
# 'scrapy.extensions.closespider.CloseSpider': 500,
#}
所以,它只是使用CloseSpider
例外。但是,日志显示以下内容:
2017-06-16 10:04:15 [scrapy.core.engine] INFO: Closing spider (closespider_itemcount)
2017-06-16 10:04:15 [scrapy.extensions.feedexport] INFO: Stored jsonlines feed (10 items) in: file:///tmp/myspiders/export.jsonl
2017-06-16 10:04:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 600,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 8599860,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'closespider_itemcount',
'finish_time': datetime.datetime(2017, 6, 16, 9, 4, 15, 615501),
'item_scraped_count': 10,
'log_count/DEBUG': 8,
'log_count/INFO': 8,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 6, 16, 9, 3, 47, 966791)}
2017-06-16 10:04:15 [scrapy.core.engine] INFO: Spider closed (closespider_itemcount)
要突出显示第一行INFO
和finish_reason
的关键事项 - INFO
下显示的消息不是我提升{{1}时设置的消息异常。这意味着CloseSpider
延伸会阻止蜘蛛,但我知道它不是吗?非常混乱。
答案 1 :(得分:1)
您还可以使用 CloseSpider 例外来限制商品数量,
注意, 在蜘蛛回调中 CloseSpider 例外仅支持。
你可以在documentation 中看到
可以从蜘蛛回调中引发此异常以请求 蜘蛛被关闭/停止。支持的参数: