使用Python读取JSON对象的序列,其中包含空格和逗号

时间:2017-06-24 10:49:55

标签: python json

我从未有过解析JSON文件的经验,直到上周我获得此任务:使用一些Python脚本读取23 MB JSON文件并将一些特定数据存储到CSV。我最近几天一直在搜索如何解析它,看到不同的实现如何用Python做到这一点,但在我的情况下没有任何作用。文件中有一个JSON对象的示例:

 {
        "created": "2017-01-19T04:39:41.012",
        "expired": "2017-01-21T04:39:41.012",
        "id": "0000e0be-d2c6-4a89-ad37-8f71d0dd9e9a",
        "mixed": false,
        "pool_id": "189591",
        "reward": 0.5,
        "status": "EXPIRED",
        "task_suite_id": "f1aa98d6-ff25-4dde-81f5-2587ccbe36af",
        "tasks": [
            {
                "id": "ffbc4048-cc5a-4578-b0d9-0705a588b55d",
                "input_values": {
                    "address-ru": "\u0420\u043e\u0441\u0441\u0438\u044f, \u0421\u0432\u0435\u0440\u0434\u043b\u043e\u0432\u0441\u043a\u0430\u044f \u043e\u0431\u043b\u0430\u0441\u0442\u044c, \u041f\u0435\u0440\u0432\u043e\u0443\u0440\u0430\u043b\u044c\u0441\u043a, 1-\u044f \u041f\u0438\u043b\u044c\u043d\u0430\u044f \u0443\u043b\u0438\u0446\u0430",
                    "company-id": "1542916387",
                    "coordinates": "56.91969408920,60.03087172680",
                    "country": "RU",
                    "language": "RU",
                    "name-ru": "\u0421\u0443\u043f\u0435\u0440\u043c\u0430\u0440\u043a\u0435\u0442",
                    "org-weight": "30",
                    "rubric": [
                        {
                            "name-ru": "\u0421\u0443\u043f\u0435\u0440\u043c\u0430\u0440\u043a\u0435\u0442",
                            "rubric-id": 184108079
                        }
                    ]
                }
            }
        ],
        "user_id": "165684b434e6390fb8da262978601397"
    },
    {
        "created": "2017-02-24T16:08:10.280",
        "expired": "2017-02-26T16:08:10.280",
        "id": "0001b81e-dbcc-4de3-985d-4397b97dbffa",
        "mixed": false,
        "pool_id": "189591",
        "reward": 0.5,
        "status": "EXPIRED",
        "task_suite_id": "5dcbbd70-e570-4026-8246-a30bb462f35d",
        "tasks": [
            {
                "id": "90437e00-d15c-4679-b7be-6d3660efdbce",
                "input_values": {
                    "address-ru": "\u041c\u043e\u0441\u043a\u043e\u0432\u0441\u043a\u0430\u044f \u043e\u0431\u043b., \u041a\u043e\u0440\u043e\u043b\u0435\u0432, \u043c\u0438\u043a\u0440\u043e\u0440\u0430\u0439\u043e\u043d \u0412\u0430\u043b\u0435\u043d\u0442\u0438\u043d\u043e\u0432\u043a\u0430, \u0443\u043b. \u0413\u043e\u0440\u044c\u043a\u043e\u0433\u043e, 12, \u043a\u043e\u0440\u043f.\u0412",
                    "company-id": "662316782",
                    "coordinates": "55.915326,37.869891",
                    "country": "RU",
                    "language": "RU",
                    "meta": [
                        {
                            "permlink-id": 1119957838
                        }
                    ],
                    "name-ru": "\u041d\u0435\u0430\u0442\u044d\u043b",
                    "org-weight": "30",
                    "rubric": [
                        {
                            "name-ru": "\u0420\u0435\u043c\u043e\u043d\u0442 \u0438\u0437\u043c\u0435\u0440\u0438\u0442\u0435\u043b\u044c\u043d\u044b\u0445 \u043f\u0440\u0438\u0431\u043e\u0440\u043e\u0432",
                            "rubric-id": 184106846
                        },
                        {
                            "name-ru": "\u0412\u043e\u0434\u043e\u0441\u0447\u0435\u0442\u0447\u0438\u043a\u0438, \u0433\u0430\u0437\u043e\u0441\u0447\u0435\u0442\u0447\u0438\u043a\u0438, \u0442\u0435\u043f\u043b\u043e\u0441\u0447\u0435\u0442\u0447\u0438\u043a\u0438",
                            "rubric-id": 184106834
                        },
                        {
                            "name-ru": "\u041e\u0442\u043e\u043f\u0438\u0442\u0435\u043b\u044c\u043d\u043e\u0435 \u043e\u0431\u043e\u0440\u0443\u0434\u043e\u0432\u0430\u043d\u0438\u0435 \u0438 \u0441\u0438\u0441\u0442\u0435\u043c\u044b",
                            "rubric-id": 184107475
                        }
                    ]
                }
            }
        ],
        "user_id": "0ba1f0e613c9b1db5fcbddd342e44a15"
    },

......依此类推数十万行。

如果我手动删除JSON对象之间的空格和逗号,这段代码(我在Stackoverflow上找到)似乎可以工作:


import json

json_objects = []

def stream_read_json(file):
    start_pos = 0
    while True:
        try:
            obj = json.load(file)
            yield obj
            return
        except json.JSONDecodeError as e:
            file.seek(start_pos)
            json_str = file.read(e.pos)
            obj = json.loads(json_str)
            start_pos += e.pos
            yield obj


with open('task1.json', 'r') as source:
    objCount = 0 
    for data in stream_read_json(source):
        json_objects.append(data)
        objCount += 1
        print('Added ' + str(objCount) + 'th json object.')

但我无法在任何地方找到如何在阅读JSON文件时摆脱这些空格和逗号。更令人沮丧的是,我找不到任何教程或手册如何使用Python编写JSON解析器以用于不同的情况,以便能够自己完成而不会打扰Stackoverflow。

任何提示和想法都将非常感激。提前谢谢。

0 个答案:

没有答案