将非结构化json解析为csv

时间:2017-10-26 10:24:11

标签: python json pandas csv

我有json格式的不同应用程序的年度应用程序数据。每个应用程序有10个不同的json文件。我尝试将它们合并为一个单独的csv。让我先向您展示数据结构:

[{"date": "2017-10-23", "downloads": 15358985, "end": "2017-10-23", "data": {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}, {"date": "2017-10-22", "downloads": 12778233, "end": "2017-10-22", "data": {"2.7.3.4196-beta": 5,  "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538,  "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}]

当我将它们解析成pandas数据帧时,我会得到类似的结果:

date         downloads  end         data

2017-10-23   15358985   2017-10-23  {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}
2017-10-22   12778233   2017-10-22  {"2.7.3.4196-beta": 5,  "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538,  "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}

请注意,并非每天都会下载所有版本。我如何为不同版本的应用程序创建一个列?如果申请未在特定日期下载,我们可以将其留空或填写NaNs

1 个答案:

答案 0 :(得分:2)

我认为您需要使用DataFrame构造函数reindex来添加缺失的行:

j = [{"date": "2017-10-25", "downloads": 15358985, "end": "2017-10-23", "data": {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}, {"date": "2017-10-22", "downloads": 12778233, "end": "2017-10-22", "data": {"2.7.3.4196-beta": 5,  "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538,  "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}]
df = pd.DataFrame(j).set_index('date')
df.index = pd.to_datetime(df.index)

df = df.reindex(pd.date_range(start=df.index.min(), end=df.index.max()))
print (df)
                                                         data   downloads  \
2017-10-22  {'2.6.4.1-signed': 8, '2.99.0.1857beta': 4, '2...  12778233.0   
2017-10-23                                                NaN         NaN   
2017-10-24                                                NaN         NaN   
2017-10-25  {'2.7.2.4151-beta': 1, '1.0.1': 268, '2.9.0.42...  15358985.0   

                   end  
2017-10-22  2017-10-22  
2017-10-23         NaN  
2017-10-24         NaN  
2017-10-25  2017-10-23  

使用json_normalize的解决方案,但如果json的不同格式获得了大量NaN s值:

df = json_normalize(j).set_index('date')
df.index = pd.to_datetime(df.index)
#
df = df.reindex(pd.date_range(start=df.index.min(), end=df.index.max()))
print (df)
            data.1.0.1  data.1.0.2  data.2.2.3.1-signed  data.2.3.1.1-signed  \
2017-10-22         NaN         NaN                  NaN                  3.0   
2017-10-23         NaN         NaN                  NaN                  NaN   
2017-10-24         NaN         NaN                  NaN                  NaN   
2017-10-25       268.0       715.0               9292.0                  NaN   

            data.2.4.1  data.2.6.10  data.2.6.4.1-signed  \
2017-10-22       842.0      11538.0                  8.0   
2017-10-23         NaN          NaN                  NaN   
2017-10-24         NaN          NaN                  NaN   
2017-10-25         NaN          NaN                  NaN   

            data.2.7.2.4151-beta  data.2.7.3.4196-beta  data.2.7.3.4198-beta  \
2017-10-22                   NaN                   5.0                   4.0   
2017-10-23                   NaN                   NaN                   NaN   
2017-10-24                   NaN                   NaN                   NaN   
2017-10-25                   1.0                   7.0                   NaN   

            data.2.7.3.4215-beta  data.2.9.0.4250-beta  data.2.99.0.1857beta  \
2017-10-22                   NaN                   NaN                   4.0   
2017-10-23                   NaN                   NaN                   NaN   
2017-10-24                   NaN                   NaN                   NaN   
2017-10-25                   2.0                   1.0                   NaN   

            data.2.99.0.1872beta   downloads         end  
2017-10-22                  12.0  12778233.0  2017-10-22  
2017-10-23                   NaN         NaN         NaN  
2017-10-24                   NaN         NaN         NaN  
2017-10-25                   NaN  15358985.0  2017-10-23