Question

我正在尝试运行一个抓抓的项目，该项目需要多个蜘蛛并将结果馈入单个JSON文件中。我正在使用Scrapy文档中的CrawlerRunner，文件名为 base.py -

import scrapy
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from my_project.spiders.spider1 import Spider1Spider
from my_project.spiders.spider2 import Spider2Spider

settings = get_project_settings()
runner = CrawlerRunner(settings)

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(Spider1Spider)
    yield runner.crawl(Spider2Spider)
    reactor.stop()

crawl()
reactor.run()

在 settings.py 中，我有

FEED_FORMAT = 'json'
FEED_URI = 'filename.json'

但是当我运行 base.py 时，JSON文件带有方括号-] [-在Spider1和Spider2的结果之间，使其成为无效的JSON文件。

使用Scrapy示例， spider1.py

import scrapy

class Spider1Spider(scrapy.Spider):
    name = 'spider1'
    start_urls = ['http://quotes.toscrape.com/page/1/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first()
            }

和 spider2.py

import scrapy

class Spider2Spider(scrapy.Spider):
    name = 'spider2'
    start_urls = ['http://quotes.toscrape.com/page/2/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.css('small.author::text').extract_first()
            }

给出结果：

[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"}
][
{"author": "Marilyn Monroe"},
{"author": "J.K. Rowling"}
]

预期结果会

[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"},
{"author": "Marilyn Monroe"},
{"author": "J.K. Rowling"}
]

我想念什么？我是否需要项目管道才能将多个蜘蛛的结果获取到一个有效的JSON文件中？

编辑

这是我玩过的管道，但目前仅产生一半的结果，仅author个结果

import json

class TestPipeline(object):

    def open_spider(self, spider):
        self.file = open('file_name.json', 'wb')
        self.file.write("[")

    def close_spider(self, spider):
        self.file.write("]")
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(
            dict(item),
            sort_keys=True,
            indent=4,
            separators=(',', ': ')
        ) + ",\n"

        self.file.write(line)
        return item

我做错了什么？

Scrapy：将多个Spider合并到一个JSON文件中

0 个答案: