我想抓取一个包含2个部分的网站,我的脚本没有我想要的那么快。
是否可以发射2个蜘蛛,一个用于刮第一部分,第二个用于第二部分?
我尝试了两个不同的类,然后运行它们
scrapy crawl firstSpider
scrapy crawl secondSpider
但我认为这不聪明。
我读了documentation of scrapyd,但我不知道这对我的情况是否有益。
答案 0 :(得分:7)
答案 1 :(得分:4)
或者您可以像这样运行,您需要使用scrapy.cfg将此代码保存在同一目录中(我的scrapy版本为1.3.3):
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
setting = get_project_settings()
process = CrawlerProcess(setting)
for spider_name in process.spiders.list():
print ("Running spider %s" % (spider_name))
process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy
process.start()
答案 2 :(得分:2)
更好的解决方案是(如果你有多个蜘蛛)它动态地获取蜘蛛并运行它们。
from scrapy import spiderloader
from scrapy.utils import project
from twisted.internet.defer import inlineCallbacks
@inlineCallbacks
def crawl():
settings = project.get_project_settings()
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
spiders = spider_loader.list()
classes = [spider_loader.load(name) for name in spiders]
for my_spider in classes:
yield runner.crawl(my_spider)
reactor.stop()
crawl()
reactor.run()
(第二个解决方案):
因为在Scrapy 1.4中不推荐使用spiders.list()
,所以应该将Yuda解决方案转换为类似
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
setting = get_project_settings()
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
for spider_name in spider_loader.list():
print ("Running spider %s" % (spider_name))
process.crawl(spider_name)
process.start()