如何运行多个Scrapy蜘蛛,每个蜘蛛抓取不同的URL?

时间:2018-07-04 22:19:51

标签: python scrapy

我在Scrapy项目中有一个spiders.py,其中包含以下蜘蛛网...

class OneSpider(scrapy.Spider):
    name = "s1"

    def start_requests(self):
        urls = ["url1.com",]
        yield scrapy.Request(
            url="http://url1.com",
            callback=self.parse
        )

    def parse(self,response):
        ## Scrape stuff, put it in a dict
        yield dictOfScrapedStuff

class TwoSpider(scrapy.Spider):
    name = "s2"

    def start_requests(self):
        urls = ["url2.com",]
        yield scrapy.Request(
            url="http://url2.com",
            callback=self.parse
        )

    def parse(self,response):
        ## Scrape stuff, put it in a dict
        yield dictOfScrapedStuff

如何运行蜘蛛程序s1s2,并将其抓取的结果写入s1.jsons2.json

1 个答案:

答案 0 :(得分:1)

Scrapy不支持将多个蜘蛛作为一个进程运行,因此您只需运行两个进程即可:

scrapy crawl s1 -o s1.json
scrapy crawl s2 -o s2.json

如果要在同一终端窗口中执行此操作,则必须:

  • 运行第一个蜘蛛->将其置于背景(ctrl + z)->运行第二个蜘蛛
  • 使用nohup,例如:

    nohup scrapy crawl s1 -o s1.json --logfile s1.log &
    
  • 使用screen命令。

    $ screen
    $ scrapy crawl s1 -o s1.json
    $ ctrl+a ctrL+d  # detach screen
    $ screen
    $ scrapy crawl s2 -o s2.json
    $ ctrl+a ctrL+d  # detach screen
    $ screen -r  # to reattach to one of your sessions to see how the spider is doing
    

我个人更喜欢nohup或screen选项,因为它们很干净,并且不会因日志记录和其他操作弄乱您的终端。