我有json格式的不同应用程序的年度应用程序数据。每个应用程序有10个不同的json文件。我尝试将它们合并为一个单独的csv。让我先向您展示数据结构:
[{"date": "2017-10-23", "downloads": 15358985, "end": "2017-10-23", "data": {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}, {"date": "2017-10-22", "downloads": 12778233, "end": "2017-10-22", "data": {"2.7.3.4196-beta": 5, "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538, "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}]
当我将它们解析成pandas数据帧时,我会得到类似的结果:
date downloads end data
2017-10-23 15358985 2017-10-23 {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}
2017-10-22 12778233 2017-10-22 {"2.7.3.4196-beta": 5, "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538, "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}
请注意,并非每天都会下载所有版本。我如何为不同版本的应用程序创建一个列?如果申请未在特定日期下载,我们可以将其留空或填写NaNs
答案 0 :(得分:2)
我认为您需要使用DataFrame
构造函数reindex
来添加缺失的行:
j = [{"date": "2017-10-25", "downloads": 15358985, "end": "2017-10-23", "data": {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}, {"date": "2017-10-22", "downloads": 12778233, "end": "2017-10-22", "data": {"2.7.3.4196-beta": 5, "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538, "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}]
df = pd.DataFrame(j).set_index('date')
df.index = pd.to_datetime(df.index)
df = df.reindex(pd.date_range(start=df.index.min(), end=df.index.max()))
print (df)
data downloads \
2017-10-22 {'2.6.4.1-signed': 8, '2.99.0.1857beta': 4, '2... 12778233.0
2017-10-23 NaN NaN
2017-10-24 NaN NaN
2017-10-25 {'2.7.2.4151-beta': 1, '1.0.1': 268, '2.9.0.42... 15358985.0
end
2017-10-22 2017-10-22
2017-10-23 NaN
2017-10-24 NaN
2017-10-25 2017-10-23
使用json_normalize
的解决方案,但如果json
的不同格式获得了大量NaN
s值:
df = json_normalize(j).set_index('date')
df.index = pd.to_datetime(df.index)
#
df = df.reindex(pd.date_range(start=df.index.min(), end=df.index.max()))
print (df)
data.1.0.1 data.1.0.2 data.2.2.3.1-signed data.2.3.1.1-signed \
2017-10-22 NaN NaN NaN 3.0
2017-10-23 NaN NaN NaN NaN
2017-10-24 NaN NaN NaN NaN
2017-10-25 268.0 715.0 9292.0 NaN
data.2.4.1 data.2.6.10 data.2.6.4.1-signed \
2017-10-22 842.0 11538.0 8.0
2017-10-23 NaN NaN NaN
2017-10-24 NaN NaN NaN
2017-10-25 NaN NaN NaN
data.2.7.2.4151-beta data.2.7.3.4196-beta data.2.7.3.4198-beta \
2017-10-22 NaN 5.0 4.0
2017-10-23 NaN NaN NaN
2017-10-24 NaN NaN NaN
2017-10-25 1.0 7.0 NaN
data.2.7.3.4215-beta data.2.9.0.4250-beta data.2.99.0.1857beta \
2017-10-22 NaN NaN 4.0
2017-10-23 NaN NaN NaN
2017-10-24 NaN NaN NaN
2017-10-25 2.0 1.0 NaN
data.2.99.0.1872beta downloads end
2017-10-22 12.0 12778233.0 2017-10-22
2017-10-23 NaN NaN NaN
2017-10-24 NaN NaN NaN
2017-10-25 NaN 15358985.0 2017-10-23