启动CrawlerProcess / Scrapy时,修改来自Spider的CSV文件输入

时间:2018-12-12 17:24:16

标签: python scrapy

我正用CrawlerProcess并行发射多个蜘蛛。

def main():

    # ----- This part launch all given spiders ----- #

    process = CrawlerProcess(get_project_settings())

    process.crawl(FirstSpider)
    process.crawl(SecondSpider)
    process.crawl(ThirdSpider)
    process.crawl(EtcSpider)

    process.start()  # the script will block here until the crawling is finished

所有蜘蛛程序都基于CSV输入文件工作,该文件包含要在网站上查找的信息。这是一个示例:

class FirstSpider(scrapy.Spider):
    name = "first_bot"

    def start_requests(self):
        base_url = "https://example.fr/catalogsearch/result/?q="
        script_dir = osp.dirname(osp.realpath(__file__))
        file_path = osp.join(script_dir, 'files', 'to_collect_firstbot.csv')
        input_file = open(file_path, 'r', encoding="utf-8", errors="ignore")
        reader = csv.reader(input_file)
        for row in reader:
            if row:
                url = row[0]
                absolute_url = base_url + url
                print(absolute_url)
                yield scrapy.Request(
                    absolute_url,
                    meta={
                        "handle_httpstatus_list": [302, 301, 502],
                    },
                    callback=self.parse
                )

它可以工作,但是我可能不得不修改输入文件名,该文件名记录在每个蜘蛛中。

是否可以在所有蜘蛛脚本中保留默认的“自定义”文件,然后将其保存到core.py文件中(启动所有蜘蛛),并根据需要修改CSV输入文件(在这种情况下,文件和名称将相同对于所有蜘蛛)?

2 个答案:

答案 0 :(得分:1)

您可以将参数传递给Spider爬网,我认为这是完成这项工作所需的条件。

将代码更改为:

class FirstSpider(scrapy.Spider):
    name = "first_bot"

    file_name = 'to_collect_firstbot.csv' # <- we are gonna change this variable later

    def start_requests(self):
        base_url = "https://example.fr/catalogsearch/result/?q="
        script_dir = osp.dirname(osp.realpath(__file__))
        file_path = osp.join(script_dir, 'files', self.file_name) # here we use the argument
        input_file = open(file_path, 'r', encoding="utf-8", errors="ignore")
        reader = csv.reader(input_file)
        for row in reader:
            if row:
                url = row[0]
                absolute_url = base_url + url
                print(absolute_url)
                yield scrapy.Request(
                    absolute_url,
                    meta={
                        "handle_httpstatus_list": [302, 301, 502],
                    },
                    callback=self.parse
                )

现在启动蜘蛛时,只需在进程爬网调用中将它们作为参数传递即可:

def main():

    # ----- This part launch all given spiders ----- #

    process = CrawlerProcess(get_project_settings())

    process.crawl(FirstSpider, file_name='custom_file1.csv')
    process.crawl(SecondSpider, file_name='custom_file2.csv')
    process.crawl(ThirdSpider)
    process.crawl(EtcSpider, file_name='custom_file_whatever.csv')

    process.start()  # the script will block here until the crawling is finished

检查第三个调用未设置file_name参数,这意味着Spider将使用Spider代码中指定的默认值:

file_name = 'to_collect_firstbot.csv'

答案 1 :(得分:0)

crawl接受参数,您可以在蜘蛛的from_crawler内部使用它们。