现在我有2只蜘蛛,我想做的是
1
转到url1
,如果出现url2
,请使用2
致电蜘蛛url2
。还可以使用管道保存url1
的内容。2
前往url2
并做点什么。由于两只蜘蛛的复杂性,我希望将它们分开。
我尝试使用scrapy crawl
:
def parse(self, response):
p = multiprocessing.Process(
target=self.testfunc())
p.join()
p.start()
def testfunc(self):
settings = get_project_settings()
crawler = CrawlerRunner(settings)
crawler.crawl(<spidername>, <arguments>)
它会加载设置但不会抓取:
2015-08-24 14:13:32 [scrapy] INFO: Enabled extensions: CloseSpider, LogStats, CoreStats, SpiderState
2015-08-24 14:13:32 [scrapy] INFO: Enabled downloader middlewares: DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, HttpAuthMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-24 14:13:32 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-24 14:13:32 [scrapy] INFO: Spider opened
2015-08-24 14:13:32 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
文档中有一个关于从脚本启动的示例,但我要做的是使用scrapy crawl
命令启动另一个蜘蛛。
编辑:完整代码
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from multiprocessing import Process
import scrapy
import os
def info(title):
print(title)
print('module name:', __name__)
if hasattr(os, 'getppid'): # only available on Unix
print('parent process:', os.getppid())
print('process id:', os.getpid())
class TestSpider1(scrapy.Spider):
name = "test1"
start_urls = ['http://www.google.com']
def parse(self, response):
info('parse')
a = MyClass()
a.start_work()
class MyClass(object):
def start_work(self):
info('start_work')
p = Process(target=self.do_work)
p.start()
p.join()
def do_work(self):
info('do_work')
settings = get_project_settings()
runner = CrawlerRunner(settings)
runner.crawl(TestSpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
return
class TestSpider2(scrapy.Spider):
name = "test2"
start_urls = ['http://www.google.com']
def parse(self, response):
info('testspider2')
return
我希望是这样的:
scrapy crawl test2
答案 0 :(得分:1)
我不会深入了解,因为这个问题真的很老但是我会继续从官方的Scrappy文档中删除这个片段....你非常接近!大声笑
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
https://doc.scrapy.org/en/latest/topics/practices.html
然后使用回调你可以在你的蜘蛛之间传递物品做你正在谈论的逻辑功能
答案 1 :(得分:0)
我们不应该从蜘蛛那里跑蜘蛛。 以我的理解,您想在其他蜘蛛结束后再运行一个蜘蛛,对吗? 如果是这样,让我们使用下面的源代码:
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from datascraper.spiders.file1_spd import Spider1ClassName
from datascraper.spiders.file2_spd import Spider2ClassName
from scrapy.utils.project import get_project_settings
@defer.inlineCallbacks
def crawl():
yield runner.crawl(Spider1ClassName)
yield runner.crawl(Spider2ClassName)
reactor.stop()
configure_logging()
config = get_project_settings()
runner = CrawlerRunner(settings=config)
crawl()
reactor.run() # the script will block here until the last crawl call is finished
您可以在这里参考:https://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process