我以前从未使用过JSON文件。我有此新闻分类数据集。我想在Pandas数据框中得到它。 看起来像这样:
{"content": "Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.","annotation":{"notes":"","label":["Business"]},"extras":null,"metadata":{"first_done_at":1521027375000,"last_updated_at":1521027375000,"sec_taken":0,"last_updated_by":"nlYZXxNBQefF2u9VX52CdONFp0C3","status":"done","evaluation":"NONE"}}
{"content": "SPACE.com - TORONTO, Canada -- A second\\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\\privately funded suborbital space flight, has officially announced the first\\launch date for its manned rocket.","annotation":{"notes":"","label":["SciTech"]},"extras":null,"metadata":{"first_done_at":1521027375000,"last_updated_at":1521027375000,"sec_taken":0,"last_updated_by":"nlYZXxNBQefF2u9VX52CdONFp0C3","status":"done","evaluation":"NONE"}}
还有更多条目,但我只发布了其中两个。每个条目都用{}括起来。每个条目都有4个键:“内容”,“注释”,“附加”,“元数据”。我想在数据帧中使用上面的键作为列。
我尝试了json库和Pandas.read_json函数,但都给了我错误。
with open('News-Classification-DataSet.json') as data_file:
df=json.load(data_file)
这给出了一个错误:JSONDecodeError: Extra data: line 2 column 1 (char 378)
答案 0 :(得分:2)
我相信您必须按照每一行的顺序读取此文件,因为它不是有效的json格式。
因此,请阅读以下内容:
import json
data = []
with open('News-Classification-DataSet.json') as f:
for line in f:
data.append(json.loads(line))
现在您应该可以使用它,但是,作为datframe输出您想要什么呢?
如果要直接转到数据框,可以按照建议进行操作:
df = pd.read_json("News-Classification-DataSet.json", lines=True)
但是您有嵌套的列,我不知道您想如何处理。
答案 1 :(得分:2)
要将行分隔的json加载到数据框中,
import pandas as pd
df = pd.read_json("News-Classification-DataSet.json", lines=True)
要将列中的dict
解析为列,
pd.concat(
[
df["annotation"].apply(pd.Series),
df[["content", "extras"]],
df["metadata"].apply(pd.Series),
],
axis=1,
)