Question

我有一个要整理的json文件。如果json文件中只有一条消息，则该功能正常运行，但是，当有多条消息时，会出现以下错误：

    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 39 column 1 (char 952)

JSON文件示例

{
    "number": "Abc",
    "date": "01.10.2016",
    "name": "R 3932",
    "locations": [
        {
            "depTimeDiffMin": "0",
            "name": "Spital am Pyhrn Bahnhof",
            "arrTime": "",
            "depTime": "06:32",
            "platform": "2",
            "stationIdx": "0",
            "arrTimeDiffMin": "",
            "track": "R 3932"
        },
        {
            "depTimeDiffMin": "0",
            "name": "Windischgarsten Bahnhof",
            "arrTime": "06:37",
            "depTime": "06:40",
            "platform": "2",
            "stationIdx": "1",
            "arrTimeDiffMin": "1",
            "track": ""
        },
        {
            "depTimeDiffMin": "",
            "name": "Linz/Donau Hbf",
            "arrTime": "08:24",
            "depTime": "",
            "platform": "1A-B",
            "stationIdx": "22",
            "arrTimeDiffMin": "1",
            "track": ""
        }
    ]
}

{
    "number": "Xyz",
    "date": "01.10.2016",
    "name": "R 3932",
    "locations": [
        {
            "depTimeDiffMin": "0",
            "name": "Spital am Pyhrn Bahnhof",
            "arrTime": "",
            "depTime": "06:32",
            "platform": "2",
            "stationIdx": "0",
            "arrTimeDiffMin": "",
            "track": "R 3932"
        },
        {
            "depTimeDiffMin": "0",
            "name": "Windischgarsten Bahnhof",
            "arrTime": "06:37",
            "depTime": "06:40",
            "platform": "2",
            "stationIdx": "1",
            "arrTimeDiffMin": "1",
            "track": ""
        },
        {
            "depTimeDiffMin": "",
            "name": "Linz/Donau Hbf",
            "arrTime": "08:24",
            "depTime": "",
            "platform": "1A-B",
            "stationIdx": "22",
            "arrTimeDiffMin": "1",
            "track": ""
        }
    ]
}

我的代码：

import json
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize


desired_width=500
pd.set_option('display.width', desired_width)
np.set_printoptions(linewidth=desired_width)
pd.set_option('display.max_columns', 100)

with open('C:/Users/username/Desktop/samplejson.json') as f:
    data = json.load(f)


def flatten_json(y):
    out = {}
    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x
    flatten(y)
    return out

for data in data:
    flat = flatten_json(data)
    new_flat = json_normalize(flat)

dfs = pd.DataFrame(new_flat)
print(dfs.head(2))

我正在尝试解析整个JSON文件并将所有数据加载到数据框中，以便可以开始将其用于分析目的。如果文件中只有一条消息，则代码可以正常工作，并且输出的表非常宽，具有很多列。

如果我在JSON文件中有多条消息，我会收到上面附加的错误。我查看了stackoverflow中的许多解决方案，但它们似乎没有

有没有更简单的方法来读取和展平JSON文件。我尝试使用大熊猫的json_normalize，但它只会展平1级。

Answer 1

如果文件中只有一条消息，则该文件为有效的 json ；但是，如果有更多消息（放置它们时），则 json 为no有效期更长（[JSON]: Introducing JSON）。示例：

>>> json.loads("{}")
{}
>>> json.loads("{} {}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Install\x64\Python\Python\03.06.08\Lib\json\__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "c:\Install\x64\Python\Python\03.06.08\Lib\json\decoder.py", line 342, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 4 (char 3)
>>> json.loads("[{}, {}]")
[{}, {}]

有关更多详细信息，请选中[Python 3]: json - JSON encoder and decoder

拥有包含多条消息的有效 json 的最简单方法：

所有这些都应放在方括号（“ [ ”，“ ] ”）中< / li>
每个连续的2个字符应以逗号分隔（“ ， ”）

就像在“ 位置”子消息中一样。

Answer 2

您可以这样做。假设j是完整的json对象。

def parse(j):
    for item in j:
        data = pd.DataFrame([{k:v for k, v in item.items() if k != 'locations'}])
        locs = pd.DataFrame(item.get('locations'))
        yield pd.concat([data, locs], axis=1).fillna(method='ffill')

pd.concat(parse(j), axis=0, ignore_index=True)

         date    name number arrTime   ...                       name platform stationIdx   track
0  01.10.2016  R 3932    Abc           ...    Spital am Pyhrn Bahnhof        2          0  R 3932
1  01.10.2016  R 3932    Abc   06:37   ...    Windischgarsten Bahnhof        2          1        
2  01.10.2016  R 3932    Abc   08:24   ...             Linz/Donau Hbf     1A-B         22        
3  01.10.2016  R 3932    Xyz           ...    Spital am Pyhrn Bahnhof        2          0  R 3932
4  01.10.2016  R 3932    Xyz   06:37   ...    Windischgarsten Bahnhof        2          1        
5  01.10.2016  R 3932    Xyz   08:24   ...             Linz/Donau Hbf     1A-B         22

您的JSON无效，因为您缺少将两个对象分开的,。

无法展平具有多个值的Json文件

2 个答案: