Question

我正在开发一个Scrapy脚本，它应该输出如下：

{
  "state": "FL",
  "date": "2017-11-03T14:52:26.007Z",
  "games": [
    {
      "name":"Game1"
    },
    {
      "name":"Game2"
    }
  ]
}

但是对我而言，当我运行scrapy crawl items -o data.json -t json时，它正在制作如下。重复state

[
{"state": "CA", "games": [], "crawlDate": "2014-10-04"},
{"state": "CA", "games": [], "crawlDate": "2014-10-04"},
]

代码如下：

导入scrapy

items.py

class Item(scrapy.Item):
 state = scrapy.Field()
 games = scrapy.Field()

在Spider文件中，item类被称为：

item = Item()
item['state'] = state
item['Date'] = '2014-10-04'
item['games'] = games

我知道这不是完整的代码，但它应该让我知道我的全部内容。

Answer 1

参考。 https://stackoverflow.com/a/43698923/8964297

您可以尝试编写自己的管道：

将其放入pipelines.py文件中：

import json


class JsonWriterPipeline(object):
    def open_spider(self, spider):
        self.file = open('scraped_items.json', 'w')
        # Your scraped items will be saved in the file 'scraped_items.json'.
        # You can change the filename to whatever you want.
        self.file.write("[")

    def close_spider(self, spider):
        self.file.write("]")
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(
            dict(item),
            indent = 4,
            sort_keys = True,
            separators = (',', ': ')
        ) + ",\n"
        self.file.write(line)
        return item

然后修改您的settings.py以包含以下内容：

ITEM_PIPELINES = {
    'YourSpiderName.pipelines.JsonWriterPipeline': 300,
}

将YourSpiderName更改为蜘蛛的正确名称。

请注意，文件直接由管道编写，因此您不必使用-o和-t命令行参数指定文件和格式。

希望这能让你更接近你需要的东西。

如何从Scrapy生成自定义JSON输出？

1 个答案: