是否有可能从Scrapy蜘蛛中运行另一只蜘蛛?

时间:2015-08-24 06:49:42

标签: python scrapy multiprocessing

现在我有2只蜘蛛,我想做的是

  1. Spider 1转到url1,如果出现url2,请使用2致电蜘蛛url2。还可以使用管道保存url1的内容。
  2. 蜘蛛2前往url2并做点什么。
  3. 由于两只蜘蛛的复杂性,我希望将它们分开。

    我尝试使用scrapy crawl

    def parse(self, response):
        p = multiprocessing.Process(
            target=self.testfunc())
        p.join()
        p.start()
    
    def testfunc(self):
        settings = get_project_settings()
        crawler = CrawlerRunner(settings)
        crawler.crawl(<spidername>, <arguments>)
    

    它会加载设置但不会抓取:

    2015-08-24 14:13:32 [scrapy] INFO: Enabled extensions: CloseSpider, LogStats, CoreStats, SpiderState
    2015-08-24 14:13:32 [scrapy] INFO: Enabled downloader middlewares: DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, HttpAuthMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
    2015-08-24 14:13:32 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2015-08-24 14:13:32 [scrapy] INFO: Spider opened
    2015-08-24 14:13:32 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    

    文档中有一个关于从脚本启动的示例,但我要做的是使用scrapy crawl命令启动另一个蜘蛛。

    编辑:完整代码

    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.project import get_project_settings
    from twisted.internet import reactor
    from multiprocessing import Process
    import scrapy
    import os
    
    
    def info(title):
        print(title)
        print('module name:', __name__)
        if hasattr(os, 'getppid'):  # only available on Unix
            print('parent process:', os.getppid())
        print('process id:', os.getpid())
    
    
    class TestSpider1(scrapy.Spider):
        name = "test1"
        start_urls = ['http://www.google.com']
    
        def parse(self, response):
            info('parse')
            a = MyClass()
            a.start_work()
    
    
    class MyClass(object):
    
        def start_work(self):
            info('start_work')
            p = Process(target=self.do_work)
            p.start()
            p.join()
    
        def do_work(self):
    
            info('do_work')
            settings = get_project_settings()
            runner = CrawlerRunner(settings)
            runner.crawl(TestSpider2)
            d = runner.join()
            d.addBoth(lambda _: reactor.stop())
            reactor.run()
            return
    
    class TestSpider2(scrapy.Spider):
    
        name = "test2"
        start_urls = ['http://www.google.com']
    
        def parse(self, response):
            info('testspider2')
            return
    

    我希望是这样的:

    1. scrapy crawl test1 (例如,当response.status_code为200时:)
    2. 在test1中
    3. ,请致电scrapy crawl test2

2 个答案:

答案 0 :(得分:1)

我不会深入了解,因为这个问题真的很老但是我会继续从官方的Scrappy文档中删除这个片段....你非常接近!大声笑

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

https://doc.scrapy.org/en/latest/topics/practices.html

然后使用回调你可以在你的蜘蛛之间传递物品做你正在谈论的逻辑功能

答案 1 :(得分:0)

我们不应该从蜘蛛那里跑蜘蛛。 以我的理解,您想在其他蜘蛛结束后再运行一个蜘蛛,对吗? 如果是这样,让我们​​使用下面的源代码:

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from datascraper.spiders.file1_spd import Spider1ClassName
from datascraper.spiders.file2_spd import Spider2ClassName
from scrapy.utils.project import get_project_settings


@defer.inlineCallbacks
def crawl():
    yield runner.crawl(Spider1ClassName)
    yield runner.crawl(Spider2ClassName)
    reactor.stop()


configure_logging()
config = get_project_settings()
runner = CrawlerRunner(settings=config)
crawl()
reactor.run() # the script will block here until the last crawl call is finished

您可以在这里参考:https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process