我想从蜘蛛JOBDIR
创建__init__
设置,或者在调用该蜘蛛时动态创建。
我想为不同的蜘蛛创建不同的JOBDIR
,例如下面示例中的FEED_URI
class QtsSpider(scrapy.Spider):
name = 'qts'
custom_settings = {
'FEED_URI': 'data_files/' + '%(site_name)s.csv',
'FEED_FORMAT': "csv",
#'JOBDIR': 'resume/' + '%(site_name2)s'
}
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com']
def __init__(self, **kw):
super(QtsSpider, self).__init__(**kw)
self.site_name = kw.get('site_name')
def parse(self, response):
#our rest part of code
我们正以这种方式调用该脚本
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
def main_function():
all_spiders = ['spider1','spider2','spider3'] # 3 different spiders
process = CrawlerProcess(get_project_settings())
for spider_name in all_spiders:
process.crawl('qts', site_name = spider_name )
process.start()
main_function()
如何为JOBDIR
这样的不同Spider实现动态创建FEED_URI
?帮助将不胜感激。
答案 0 :(得分:0)
您实际上是如何设置site_name的,您可以传递另一个参数,
process.crawl('qts', site_name=spider_name, jobdir='dirname that you want to keep')
将作为蜘蛛属性可用,因此您可以编写
def __init__(self):
jobdir = getattr(self, 'jobdir', None)
if jobdir:
self.custom_settings['JOBDIR'] = jobdir
答案 1 :(得分:0)
我发现自己需要相同的功能,主要是因为不想将自定义JOBDIR
重复地添加到每个蜘蛛的custom_settings
属性中。因此,我创建了一个简单的extension
,该子类继承了Scrapy用于保存抓取状态的原始SpiderState
扩展。
from scrapy import signals
from scrapy.exceptions import NotConfigured
from scrapy.extensions.spiderstate import SpiderState
import os
class SpiderStateManager(SpiderState):
"""
SpiderState Purpose: Store and load spider state during a scraping job
Added Purpose: Create a unique subdirectory within JOBDIR for each spider based on spider.name property
Reasoning: Reduces repetitive code
Usage: Instead of needing to add subdirectory paths in each spider.custom_settings dict
Simply specify the base JOBDIR in settings.py and the subdirectories are automatically managed
"""
def __init__(self, jobdir=None):
self.jobdir = jobdir
super(SpiderStateManager, self).__init__(jobdir=self.jobdir)
@classmethod
def from_crawler(cls, crawler):
base_jobdir = crawler.settings['JOBDIR']
if not base_jobdir:
raise NotConfigured
spider_jobdir = os.path.join(base_jobdir, crawler.spidercls.name)
if not os.path.exists(spider_jobdir):
os.makedirs(spider_jobdir)
obj = cls(spider_jobdir)
crawler.signals.connect(obj.spider_closed, signal=signals.spider_closed)
crawler.signals.connect(obj.spider_opened, signal=signals.spider_opened)
return obj
要启用它,请记住像这样将正确的设置添加到您的settings.py
EXTENSIONS = {
# We want to disable the original SpiderState extension and use our own
"scrapy.extensions.spiderstate.SpiderState": None,
"spins.extensions.SpiderStateManager": 0
}
JOBDIR = "C:/Users/CaffeinatedMike/PycharmProjects/ScrapyDapyDoo/jobs"