如何在Scrpay Spider中动态创建JOBDIR设置?

时间:2018-09-07 09:17:00

标签: web-scraping scrapy scrapy-settings

我想从蜘蛛JOBDIR创建__init__设置,或者在调用该蜘蛛时动态创建。 我想为不同的蜘蛛创建不同的JOBDIR,例如下面示例中的FEED_URI

    class QtsSpider(scrapy.Spider):
    name = 'qts'
    custom_settings = {
        'FEED_URI': 'data_files/' + '%(site_name)s.csv',
        'FEED_FORMAT': "csv",
        #'JOBDIR': 'resume/' + '%(site_name2)s'
    }
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com']


    def __init__(self, **kw):
        super(QtsSpider, self).__init__(**kw)
        self.site_name = kw.get('site_name')

    def parse(self, response):
        #our rest part of code 

我们正以这种方式调用该脚本

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


def main_function():
    all_spiders = ['spider1','spider2','spider3'] # 3 different spiders
    process = CrawlerProcess(get_project_settings())
    for spider_name in all_spiders:
        process.crawl('qts', site_name = spider_name )

    process.start()

main_function()

如何为JOBDIR这样的不同Spider实现动态创建FEED_URI?帮助将不胜感激。

2 个答案:

答案 0 :(得分:0)

您实际上是如何设置site_name的,您可以传递另一个参数,

process.crawl('qts', site_name=spider_name, jobdir='dirname that you want to keep')

将作为蜘蛛属性可用,因此您可以编写

def __init__(self):
    jobdir = getattr(self, 'jobdir', None)    

    if jobdir:
        self.custom_settings['JOBDIR'] = jobdir

答案 1 :(得分:0)

我发现自己需要相同的功能,主要是因为不想将自定义JOBDIR重复地添加到每个蜘蛛的custom_settings属性中。因此,我创建了一个简单的extension,该子类继承了Scrapy用于保存抓取状态的原始SpiderState扩展。

from scrapy import signals
from scrapy.exceptions import NotConfigured
from scrapy.extensions.spiderstate import SpiderState
import os


class SpiderStateManager(SpiderState):
    """
    SpiderState Purpose: Store and load spider state during a scraping job
    Added Purpose: Create a unique subdirectory within JOBDIR for each spider based on spider.name property
    Reasoning: Reduces repetitive code
    Usage: Instead of needing to add subdirectory paths in each spider.custom_settings dict
        Simply specify the base JOBDIR in settings.py and the subdirectories are automatically managed
    """

    def __init__(self, jobdir=None):
        self.jobdir = jobdir
        super(SpiderStateManager, self).__init__(jobdir=self.jobdir)

    @classmethod
    def from_crawler(cls, crawler):
        base_jobdir = crawler.settings['JOBDIR']
        if not base_jobdir:
            raise NotConfigured
        spider_jobdir = os.path.join(base_jobdir, crawler.spidercls.name)
        if not os.path.exists(spider_jobdir):
            os.makedirs(spider_jobdir)

        obj = cls(spider_jobdir)
        crawler.signals.connect(obj.spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(obj.spider_opened, signal=signals.spider_opened)
        return obj

要启用它,请记住像这样将正确的设置添加到您的settings.py

EXTENSIONS = {
    # We want to disable the original SpiderState extension and use our own
    "scrapy.extensions.spiderstate.SpiderState": None,
    "spins.extensions.SpiderStateManager": 0
}
JOBDIR = "C:/Users/CaffeinatedMike/PycharmProjects/ScrapyDapyDoo/jobs"