Question

我正在做一个使用多个蜘蛛的scrapy项目。我正在使用不同的蜘蛛，因为每个蜘蛛的抓取内容完全不同。

我正在使用此代码运行我的（4）蜘蛛：

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess


setting = get_project_settings()
process = CrawlerProcess(setting)

for spider_name in process.spiders.list():
    print ("Running spider %s" % (spider_name))
    process.crawl(spider_name, query="dvh") 

process.start()

代码工作正常，但问题是我的蜘蛛2,3和4的start_urls是动态的。我从mongodb获得了start_urls。

start_urls = client.db.collection.distinct("urls_scraped_by_previous_spider")

主要问题是当process.crawl(spider_name, query="dvh")加载所有蜘蛛（和start_urls）但数据库尚未包含网址时。因此，脚本将在运行第一个蜘蛛后结束。

我也尝试过：

def start_requests(self):
    self.start_urls += client.db.collection.distinct("urls_scraped_by_previous_spider")
    for url in self.start_urls:
        yield SplashRequest(url, self.parse,
            endpoint='render.json',
            args={'html': 1,'har':1,'wait': 2.5}
        )

使用数据库中的抓取URL作为start_urls运行多个蜘蛛

0 个答案: