Question

这是我用来调用scrapy的python脚本，

的答案

Scrapy crawl from script always blocks script execution after scraping

def stop_reactor():
    reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider(start_url='abc')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Running reactor...')
reactor.run()  # the script will block here until the spider is closed
log.msg('Reactor stopped.')

这是我的pipelines.py代码

from scrapy import log,signals
from scrapy.contrib.exporter import JsonItemExporter
from scrapy.xlib.pydispatch import dispatcher

class scrapermar11Pipeline(object):


    def __init__(self):
        self.files = {}
        dispatcher.connect(self.spider_opened , signals.spider_opened)
        dispatcher.connect(self.spider_closed , signals.spider_closed)


    def spider_opened(self,spider):
        file = open('links_pipelines.json' ,'wb')
        self.files[spider] = file
        self.exporter = JsonItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self,spider):
       self.exporter.finish_exporting()
       file = self.files.pop(spider)
       file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        log.msg('It reached here')
        return item

此代码取自此处

Scrapy :: Issues with JSON export

当我像这样运行爬虫时

scrapy crawl MySpider -a start_url='abc'

创建了具有预期输出的链接文件。但是当我执行python脚本时，它不会创建任何文件，尽管爬行程序运行，因为转储的scrapy统计信息与上一次运行的类似。我认为python脚本中有一个错误，因为文件是在第一种方法中创建的。如何让脚本输出文件？

Answer 1

此代码对我有用：

from scrapy import signals, log
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.http import Request
from multiprocessing.queues import Queue
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process
# import your spider here
def handleSpiderIdle(spider):
        reactor.stop()
mySettings = {'LOG_ENABLED': True, 'ITEM_PIPELINES': '<name of your project>.pipelines.scrapermar11Pipeline'} 

settings.overrides.update(mySettings)

crawlerProcess = CrawlerProcess(settings)
crawlerProcess.install()
crawlerProcess.configure()

spider = <nameofyourspider>(domain="") # create a spider ourselves
crawlerProcess.crawl(spider) # add it to spiders pool

dispatcher.connect(handleSpiderIdle, signals.spider_idle) # use this if you need to handle idle event (restart spider?)

log.start() # depends on LOG_ENABLED
print "Starting crawler."
crawlerProcess.start()
print "Crawler stopped."

Answer 2

对我有用的解决方案是放弃运行脚本并使用内部API并使用命令行＆amp; GNU并行转换为并行化。

运行所有已知的蜘蛛，每个核心一个：

scrapy list | parallel --line-buffer scrapy crawl

scrapy list列出了每行一个蜘蛛，允许我们将它们作为参数传递给追加到传递给GNU Parallel的命令（scrapy crawl）。 --line-buffer表示从进程收到的输出将打印到stdout混合，但是逐行而不是四分之一/半行混乱（对于其他选项，请查看--group和--ungroup）。

注意：显然这在具有多个CPU核心的机器上效果最好，默认情况下，GNU Parallel将为每个核心运行一个作业。请注意，与许多现代开发机器不同，便宜的AWS EC2＆amp; DigitalOcean层只有一个虚拟CPU核心。因此，如果您希望在一个核心上同时运行作业，则必须使用GNU Parellel的--jobs参数。例如，每个核心运行2个scrapy爬虫：

scrapy list | parallel --jobs 200% --line-buffer scrapy crawl

从python脚本调用scrapy而不创建JSON输出文件

2 个答案: