Scrapy:将多个Spider合并到一个JSON文件中

时间:2018-08-08 04:02:23

标签: python json scrapy

我正在尝试运行一个抓抓的项目,该项目需要多个蜘蛛并将结果馈入单个JSON文件中。我正在使用Scrapy文档中的CrawlerRunner,文件名为 base.py -

import scrapy
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from my_project.spiders.spider1 import Spider1Spider
from my_project.spiders.spider2 import Spider2Spider

settings = get_project_settings()
runner = CrawlerRunner(settings)

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(Spider1Spider)
    yield runner.crawl(Spider2Spider)
    reactor.stop()

crawl()
reactor.run()

settings.py 中,我有

FEED_FORMAT = 'json'
FEED_URI = 'filename.json'

但是当我运行 base.py 时,JSON文件带有方括号-] [-在Spider1和Spider2的结果之间,使其成为无效的JSON文件。

使用Scrapy示例, spider1.py

import scrapy

class Spider1Spider(scrapy.Spider):
    name = 'spider1'
    start_urls = ['http://quotes.toscrape.com/page/1/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first()
            }

spider2.py

import scrapy

class Spider2Spider(scrapy.Spider):
    name = 'spider2'
    start_urls = ['http://quotes.toscrape.com/page/2/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.css('small.author::text').extract_first()
            }

给出结果:

[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"}
][
{"author": "Marilyn Monroe"},
{"author": "J.K. Rowling"}
]

预期结果会

[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"},
{"author": "Marilyn Monroe"},
{"author": "J.K. Rowling"}
]

我想念什么?我是否需要项目管道才能将多个蜘蛛的结果获取到一个有效的JSON文件中?

编辑

这是我玩过的管道,但目前仅产生一半的结果,仅author个结果

import json

class TestPipeline(object):

    def open_spider(self, spider):
        self.file = open('file_name.json', 'wb')
        self.file.write("[")

    def close_spider(self, spider):
        self.file.write("]")
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(
            dict(item),
            sort_keys=True,
            indent=4,
            separators=(',', ': ')
        ) + ",\n"

        self.file.write(line)
        return item      

我做错了什么?

0 个答案:

没有答案