Question

我有一系列文件，每个文件都有一个JSON，我想读入Pandas。这很简单：

data_unfiltered = [json.load(open(jf)) for jf in json_input_files]
# next call used to be df = pandas.DataFrame(data_unfiltered)
# instead, json_normalize flattens nested dicts
df = json_normalize(data_unfiltered)

现在，出现了新的皱纹。其中一些输入文件不再具有纯JSON，而是具有（python）JSON列表：[ { JSON }, { JSON }... ]。

json.load很棒，因为它输入了一个完整的文件并将其直接放入JSON；我根本不必解析文件。我现在如何将文件列表（其中一些具有一个JSON对象，而某些文件具有JSON对象的列表）转换为已解析的JSON对象的列表？

奖金问题：我曾经能够通过以下方式将文件名添加到每个JSON中

df['filename'] = pandas.Series(json_input_files).values

但是现在我不能再这样做了，因为现在一个输入文件可能对应于输出列表中的许多JSON。在将文件名放入大熊猫数据框之前，如何将其添加到JSON中？

编辑：根据评论中的请求进行中：

data_unfiltered = []
for jf_file in json_input_files:
    jf = open(jf_file)
    if isinstance(jf, list):  # this is obviously wrong
        for j in jf:
            d = json.load(j)   # this is what I need to fix
            d['details'] = jf_file
            data_unfiltered.append(d)
    else:  # not a list, assume dict
        d = json.load(jf)
        d['details'] = jf_file
        data_unfiltered.append(d)

但是json.load()可以很好地满足我的需求（JSON的文件对象），并且对数组没有等效项。我认为我必须手动将文件解析为blob列表，然后在每个blob上执行json.loads()？那真是太笨拙了。

从文件集中读取JSON / JSON数组

0 个答案: