Question

我以前从未使用过JSON文件。我有此新闻分类数据集。我想在Pandas数据框中得到它。看起来像这样：

{"content": "Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.","annotation":{"notes":"","label":["Business"]},"extras":null,"metadata":{"first_done_at":1521027375000,"last_updated_at":1521027375000,"sec_taken":0,"last_updated_by":"nlYZXxNBQefF2u9VX52CdONFp0C3","status":"done","evaluation":"NONE"}}
{"content": "SPACE.com - TORONTO, Canada -- A second\\team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.","annotation":{"notes":"","label":["SciTech"]},"extras":null,"metadata":{"first_done_at":1521027375000,"last_updated_at":1521027375000,"sec_taken":0,"last_updated_by":"nlYZXxNBQefF2u9VX52CdONFp0C3","status":"done","evaluation":"NONE"}}

还有更多条目，但我只发布了其中两个。每个条目都用{}括起来。每个条目都有4个键：“内容”，“注释”，“附加”，“元数据”。我想在数据帧中使用上面的键作为列。

我尝试了json库和Pandas.read_json函数，但都给了我错误。

with open('News-Classification-DataSet.json') as data_file:
  df=json.load(data_file)

这给出了一个错误：JSONDecodeError: Extra data: line 2 column 1 (char 378)

Answer 1

我相信您必须按照每一行的顺序读取此文件，因为它不是有效的json格式。

因此，请阅读以下内容：

import json

data = []
with open('News-Classification-DataSet.json') as f:
    for line in f:
        data.append(json.loads(line))

现在您应该可以使用它，但是，作为datframe输出您想要什么呢？

如果要直接转到数据框，可以按照建议进行操作：

df = pd.read_json("News-Classification-DataSet.json", lines=True)

但是您有嵌套的列，我不知道您想如何处理。

Answer 2

要将行分隔的json加载到数据框中，

import pandas as pd

df = pd.read_json("News-Classification-DataSet.json", lines=True)

要将列中的dict解析为列，

pd.concat(
    [
        df["annotation"].apply(pd.Series),
        df[["content", "extras"]],
        df["metadata"].apply(pd.Series),
    ],
    axis=1,
)

如何将此JSON文件存储在Pandas数据框中？

2 个答案: