Question

我正在尝试对自己的Google时间轴上的位置数据做一些描述。但是在尝试获取一些可行的数据时，要将它从JSON文件转换为DataFrame。它提出了一些我希望得到一些答案的问题，因为在尝试将JSON文件转换为DataFrame时，我觉得我打算以低效的方式进行操作。

描述我的JSON的样子。它是一个3级深度的JSON，拥有大约450万行。 JSON的一个小例子：

"locations" : [ 
{
  "timestampMs" : "1489591483",
  "latitudeE7" : -21.61909,
  "longitudeE7" : 121.65283,
  "accuracy" : 23,
  "velocity" : 18,
  "heading" : 182,
  "altitude" : 55,
  "activity" : [ {
    "timestampMs" : "1489591507",
    "activity" : [ {
      "type" : "IN_VEHICLE",
      "confidence" : 49
    }, {
      "type" : "UNKNOWN",
      "confidence" : 17
    }, {
      "type" : "ON_BICYCLE",
      "confidence" : 15
    }, {
      "type" : "ON_FOOT",
      "confidence" : 9
    }, {
      "type" : "STILL",
      "confidence" : 9
    }, {
      "type" : "WALKING",
      "confidence" : 9
    } ]
  } ]
},
...
]

要将其转换为DataFrame，我想将这3个级别压缩到0级。我已经看到json_normalize与.apply或.append结合使用的一些实现，但是你仍然需要知道值的关键，我宁愿看到它更通用（所以不知道密钥）。它还需要手动迭代值。现在我想知道的是：“有没有一种方法可以自动将JSON降低到0级而不使用apply或append？”如果没有这样的方法，那么将JSON展平并将其转换为DataFrame的首选方法是什么？

编辑：添加了一个DataFrame应该是什么样子的示例，以及更好的JSON示例。

举一个DataFrame应该是什么样子的小例子，请看下图：

要包含JSON外观的更好示例，我在下面包含了一个Pastebin URL： tiny location history sample

Answer 1

使用json_normalize，指定record_path和meta_path。

df = pd.io.json.json_normalize(d, ['locations', 'activity', 'activity'], 
                         ['locations', ['locations', 'activity', 'timestampMs']])
df = df.drop('locations', 1).add_prefix('activity.')
v = pd.DataFrame(df['locations'].tolist()).drop('activity', 1)    

pd.concat([df, v], 1)


   activity.confidence activity.type activity.locations.activity.timestampMs  \
0                   49    IN_VEHICLE                              1489591507   
1                   17       UNKNOWN                              1489591507   
2                   15    ON_BICYCLE                              1489591507   
3                    9       ON_FOOT                              1489591507   
4                    9         STILL                              1489591507   
5                    9       WALKING                              1489591507   

   accuracy  altitude  heading  latitudeE7  longitudeE7 timestampMs  velocity  
0        23        55      182   -21.61909    121.65283  1489591483        18  
1        23        55      182   -21.61909    121.65283  1489591483        18  
2        23        55      182   -21.61909    121.65283  1489591483        18  
3        23        55      182   -21.61909    121.65283  1489591483        18  
4        23        55      182   -21.61909    121.65283  1489591483        18  
5        23        55      182   -21.61909    121.65283  1489591483        18

Answer 2

您需要.. flatten_json库：https://github.com/amirziai/flatten

然后使用此功能和代码自动将深度嵌套的json转换为pandas数据框

from flatten_json import flatten
#testjson sample string is provided further below

def jsonNormalize(data):
    dic_flattened = (flatten(dd) for dd in data)
    df = pd.DataFrame(dic_flattened)
    return df


# flatten(testjson)
df1 = jsonNormalize(testjson)
df1

这将生成数据帧df，如下所示：

testjson示例字符串如下：

testjson = [{"locations" : [ 
{
  "timestampMs" : "1489591483",
  "latitudeE7" : -21.61909,
  "longitudeE7" : 121.65283,
  "accuracy" : 23,
  "velocity" : 18,
  "heading" : 182,
  "altitude" : 55,
  "activity" : [ {
    "timestampMs" : "1489591507",
    "activity" : [ {
      "type" : "IN_VEHICLE",
      "confidence" : 49
    }, {
      "type" : "UNKNOWN",
      "confidence" : 17
    }, {
      "type" : "ON_BICYCLE",
      "confidence" : 15
    }, {
      "type" : "ON_FOOT",
      "confidence" : 9
    }, {
      "type" : "STILL",
      "confidence" : 9
    }, {
      "type" : "WALKING",
      "confidence" : 9
    } ]
  } ]
}
]},
           {"locations" : [ 
{
  "timestampMs" : "1489591483",
  "latitudeE7" : -21.61909,
  "longitudeE7" : 121.65283,
  "accuracy" : 23,
  "velocity" : 18,
  "heading" : 182,
  "altitude" : 55,
  "activity" : [ {
    "timestampMs" : "1489591507",
    "activity" : [ {
      "type" : "IN_VEHICLE",
      "confidence" : 49
    }, {
      "type" : "UNKNOWN",
      "confidence" : 17
    }, {
      "type" : "ON_BICYCLE",
      "confidence" : 15
    }, {
      "type" : "ON_FOOT",
      "confidence" : 9
    }, {
      "type" : "STILL",
      "confidence" : 9
    }, {
      "type" : "WALKING",
      "confidence" : 9
    } ]
  } ]
}
]}]

将嵌套的JSON平铺以获得Dataframe的最快和最通用的方法是什么？

2 个答案: