Question

我是Python和初学者Scrapy。我刚刚创建了一个包含多个蜘蛛的Scrapy项目，当运行“scrapy crawl ..”时，它只运行第一个蜘蛛。

如何在同一过程中运行所有蜘蛛？

提前致谢。

Answer 1

文件中的每个蜘蛛都会有一个名称为name="youspidername"的名称。当你使用scrapy crawl yourspidername调用它时，它只会抓取那个蜘蛛。您将不得不再次使用scrapy crawl youotherspidername命令运行另一个蜘蛛。

另一种方法是在同一个命令中提及所有蜘蛛，例如scrapy crawl yourspidername,yourotherspidername,etc..（scrapy的新版本不支持此方法）

Answer 2

每个人，甚至是文档，都建议使用内部API来创建一个＆＃34;运行脚本＆＃34;它控制多个蜘蛛的开始和停止。然而，这有很多警告，除非你得到它绝对正确（feedexports不工作，扭曲的反应堆要么不停止或停止太快等）。

在我看来，我们有一个已知的工作和支持scrapy crawl x命令，因此更简单的方法是使用GNU Parallel来实现parellize。

安装完成后，为每个核心运行（从shell）一个scrapy spider并假设您希望在项目中运行所有这些蜘蛛：

scrapy list | parallel --line-buffer scrapy crawl

如果你只有一个核心，你可以使用GNU Parallel的--jobs参数。例如，以下内容将为每个核心运行2个scrapy作业：

scrapy list | parallel --jobs 200% --line-buffer scrapy crawl

Answer 3

默认情况下，Scrapy在运行时每个进程只运行一个蜘蛛抓爬网。但是，Scrapy支持每个运行多个蜘蛛使用内部API进行处理。

有关更多信息，请在此处查看： https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

如何在Scrapy中的同一进程中运行多个蜘蛛

3 个答案: