假定以下要在Python上平展的JSON文件的片段。
{
"locations" : [ {
"timestampMs" : "1549913792265",
"latitudeE7" : 323518421,
"longitudeE7" : -546166813,
"accuracy" : 13,
"altitude" : 1,
"verticalAccuracy" : 2,
"activity" : [ {
"timestampMs" : "1549913286057",
"activity" : [ {
"type" : "STILL",
"confidence" : 100
} ]
}, {
"timestampMs" : "1549913730454",
"activity" : [ {
"type" : "DRIVING",
"confidence" : 100
} ]
} ]
}, {
"timestampMs" : "1549912693813",
"latitudeE7" : 323518421,
"longitudeE7" : -546166813,
"accuracy" : 13,
"altitude" : 1,
"verticalAccuracy" : 2,
"activity" : [ {
"timestampMs" : "1549911547308",
"activity" : [ {
"type" : "ACTIVE",
"confidence" : 100
} ]
}, {
"timestampMs" : "1549912330473",
"activity" : [ {
"type" : "BIKING",
"confidence" : 100
} ]
} ]
} ]
}
目标是将其变成一个扁平的数据框,如下所示:
location_id timestampMs ... verticalAccuracy activity_timestampMs activity_activity_type ...
1 1549913792265 13 1549913286057 "STILL"
1 1549913792265 13 1549913730454 "DRIVING"
etc.
鉴于关键“活动”在不同的嵌套层级重复出现,人们将如何做?
答案 0 :(得分:1)
这里是使用json_normalize
(documentation)的解决方案,假设您发布的JSON代码段位于名为d
的python字典中。
from pandas.io.json import json_normalize
# Build a list of paths to JSON fields that will end up as metadata
# in the final DataFrame
meta = list(js['locations'][0].keys())
# meta is now this:
# ['timestampMs',
# 'latitudeE7',
# 'longitudeE7',
# 'accuracy',
# 'altitude',
# 'verticalAccuracy',
# 'activity']
# Almost correct. We need to remove 'activity' and append
# the list ['activity', 'timestampMs'] to meta.
meta.remove('activity')
meta.append(['activity', 'timestampMs'])
# meta is now this:
# ['timestampMs',
# 'latitudeE7',
# 'longitudeE7',
# 'accuracy',
# 'altitude',
# 'verticalAccuracy',
# ['activity', 'timestampMs']]
# Use json_normalize on the list of dicts
# that lives at d['locations'], passing in
# the appropriate record path and metadata
# paths, and specifying the double 'activity_'
# record prefix.
json_normalize(d['locations'],
record_path=['activity', 'activity'],
meta=meta,
record_prefix='activity_activity_')
activity_activity_confidence activity_activity_type timestampMs latitudeE7 longitudeE7 accuracy altitude verticalAccuracy activity.timestampMs
0 100 STILL 1549913792265 323518421 -546166813 13 1 2 1549913286057
1 100 DRIVING 1549913792265 323518421 -546166813 13 1 2 1549913730454
2 100 ACTIVE 1549912693813 323518421 -546166813 13 1 2 1549911547308
3 100 BIKING 1549912693813 323518421 -546166813 13 1 2 1549912330473
如果有时缺少['activity', 'activity']
记录路径,则上面的代码将引发错误。以下变通办法应适用于这种特定情况,但它很脆弱,并且可能会因输入数据的大小而变慢,令人无法接受:
# Create an example by deleting one of the 'activity' paths
# from the original dict
del d['locations'][0]['activity']
pd.concat([json_normalize(x,
record_path=['activity', 'activity']
if 'activity' in x.keys() else None,
meta=meta,
record_prefix='activity_activity_')
for x in d['locations']],
axis=0,
ignore_index=True,
sort=False)
accuracy altitude latitudeE7 longitudeE7 timestampMs verticalAccuracy activity_activity_confidence activity_activity_type activity.timestampMs
0 13 1 323518421 -546166813 1549913792265 2 NaN NaN NaN
1 13 1 323518421 -546166813 1549912693813 2 100.0 ACTIVE 1549911547308
2 13 1 323518421 -546166813 1549912693813 2 100.0 BIKING 1549912330473