将具有重复键的嵌套JSON文件转换为Python中的数据框

时间:2019-02-13 02:37:21

标签: python json pandas dataframe

假定以下要在Python上平展的JSON文件的片段。

{
  "locations" : [ {
    "timestampMs" : "1549913792265",
    "latitudeE7" : 323518421,
    "longitudeE7" : -546166813,
    "accuracy" : 13,
    "altitude" : 1,
    "verticalAccuracy" : 2,
    "activity" : [ {
      "timestampMs" : "1549913286057",
      "activity" : [ {
        "type" : "STILL",
        "confidence" : 100
      } ]
    }, {
      "timestampMs" : "1549913730454",
      "activity" : [ {
        "type" : "DRIVING",
        "confidence" : 100
      } ]
    } ]
  }, {
    "timestampMs" : "1549912693813",
    "latitudeE7" : 323518421,
    "longitudeE7" : -546166813,
    "accuracy" : 13,
    "altitude" : 1,
    "verticalAccuracy" : 2,
    "activity" : [ {
      "timestampMs" : "1549911547308",
      "activity" : [ {
        "type" : "ACTIVE",
        "confidence" : 100
      } ]
    }, {
      "timestampMs" : "1549912330473",
      "activity" : [ {
        "type" : "BIKING",
        "confidence" : 100
      } ]
    } ]
  } ]
}

目标是将其变成一个扁平的数据框,如下所示:

location_id timestampMs ... verticalAccuracy activity_timestampMs activity_activity_type ...
1           1549913792265   13               1549913286057        "STILL"
1           1549913792265   13               1549913730454        "DRIVING"
etc.

鉴于关键“活动”在不同的嵌套层级重复出现,人们将如何做?

1 个答案:

答案 0 :(得分:1)

这里是使用json_normalizedocumentation)的解决方案,假设您发布的JSON代码段位于名为d的python字典中。

from pandas.io.json import json_normalize

# Build a list of paths to JSON fields that will end up as metadata
# in the final DataFrame
meta = list(js['locations'][0].keys())

# meta is now this:
# ['timestampMs',
# 'latitudeE7',
# 'longitudeE7',
# 'accuracy',
# 'altitude',
# 'verticalAccuracy',
# 'activity']

# Almost correct. We need to remove 'activity' and append
# the list ['activity', 'timestampMs'] to meta.
meta.remove('activity')
meta.append(['activity', 'timestampMs'])

# meta is now this:
# ['timestampMs',
# 'latitudeE7',
# 'longitudeE7',
# 'accuracy',
# 'altitude',
# 'verticalAccuracy',
# ['activity', 'timestampMs']]

# Use json_normalize on the list of dicts
# that lives at d['locations'], passing in
# the appropriate record path and metadata
# paths, and specifying the double 'activity_'
# record prefix.
json_normalize(d['locations'], 
               record_path=['activity', 'activity'], 
               meta=meta,
               record_prefix='activity_activity_')

   activity_activity_confidence activity_activity_type    timestampMs  latitudeE7  longitudeE7  accuracy  altitude  verticalAccuracy activity.timestampMs
0                           100                  STILL  1549913792265   323518421   -546166813        13         1                 2        1549913286057
1                           100                DRIVING  1549913792265   323518421   -546166813        13         1                 2        1549913730454
2                           100                 ACTIVE  1549912693813   323518421   -546166813        13         1                 2        1549911547308
3                           100                 BIKING  1549912693813   323518421   -546166813        13         1                 2        1549912330473

编辑

如果有时缺少['activity', 'activity']记录路径,则上面的代码将引发错误。以下变通办法适用于这种特定情况,但它很脆弱,并且可能会因输入数据的大小而变慢,令人无法接受:

# Create an example by deleting one of the 'activity' paths 
# from the original dict
del d['locations'][0]['activity']

pd.concat([json_normalize(x, 
                          record_path=['activity', 'activity'] 
                                      if 'activity' in x.keys() else None, 
                          meta=meta, 
                          record_prefix='activity_activity_') 
           for x in d['locations']], 
          axis=0, 
          ignore_index=True,
          sort=False)

   accuracy  altitude  latitudeE7  longitudeE7    timestampMs  verticalAccuracy  activity_activity_confidence activity_activity_type activity.timestampMs
0        13         1   323518421   -546166813  1549913792265                 2                           NaN                    NaN                  NaN
1        13         1   323518421   -546166813  1549912693813                 2                         100.0                 ACTIVE        1549911547308
2        13         1   323518421   -546166813  1549912693813                 2                         100.0                 BIKING        1549912330473