如何将任何嵌套的json转换为熊猫数据框

时间:2019-02-27 14:31:07

标签: python json pandas

我目前正在一个项目中,该项目将分析多个数据源以获取信息,其他数据源也很好,但是json及其有时深度嵌套的结构给我带来很多麻烦。我曾尝试将json转换成python字典,但运气不佳,因为它变得越来越复杂,可能会开始挣扎。例如,此示例json文件:

{
  "Employees": [
    {
      "userId": "rirani",
      "jobTitleName": "Developer",
      "firstName": "Romin",
      "lastName": "Irani",
      "preferredFullName": "Romin Irani",
      "employeeCode": "E1",
      "region": "CA",
      "phoneNumber": "408-1234567",
      "emailAddress": "romin.k.irani@gmail.com"
    },
    {
      "userId": "nirani",
      "jobTitleName": "Developer",
      "firstName": "Neil",
      "lastName": "Irani",
      "preferredFullName": "Neil Irani",
      "employeeCode": "E2",
      "region": "CA",
      "phoneNumber": "408-1111111",
      "emailAddress": "neilrirani@gmail.com"
    }
  ]
}
转换为字典并执行dict.keys()后的

仅返回“雇员”。 然后我求助于选择熊猫数据框,可以通过调用json_normalize(dict['Employees'], sep="_")来实现我想要的功能,但是我的问题是它必须适用于所有json,并且不能事先查看数据,因此我的归一化方法这种方式并不总是有效。有什么办法可以编写某种可以将任何json转换为漂亮的pandas数据框的函数?我已经搜索了大约2周的时间来寻找答案bt,但对我的具体问题没有任何运气。谢谢

3 个答案:

答案 0 :(得分:0)

过去,我不得不这样做(展开一个大的嵌套json)。 blog确实很有帮助。这样的事情对您有用吗?

请注意,就像其他人所说的那样,要使它适用于每个JSON,这都是一项艰巨的任务,如果您有更多的json格式对象,我只是提供一种入门方法。我以为它们与您作为示例发布的内容相对接近,希望结构类似。)

jsonStr = '''{
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani@gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani@gmail.com"
}]
}'''

它将整个json展平为单行,然后可以放入数据框。在这种情况下,它将创建18列的1行。然后使用这些列名称中的数字值遍历这些列,以重构为多行。如果您使用其他嵌套的json,则我认为它在理论上应该可以工作,但是您必须对其进行测试。

import json
import pandas as pd
import re

def flatten_json(y):
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(y)
    return out

jsonObj = json.loads(jsonStr)
flat = flatten_json(jsonObj)



results = pd.DataFrame()
columns_list = list(flat.keys())
for item in columns_list:
    row_idx = re.findall(r'\_(\d+)\_', item )[0]
    column = item.replace('_'+row_idx+'_', '_')
    row_idx = int(row_idx)
    value = flat[item]

    results.loc[row_idx, column] = value

print (results)

输出:

print (results)
  Employees_userId           ...              Employees_emailAddress
0           rirani           ...             romin.k.irani@gmail.com
1           nirani           ...                neilrirani@gmail.com

[2 rows x 9 columns]

答案 1 :(得分:0)

d={
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani@gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani@gmail.com"
}]
}
import pandas as pd
df=pd.DataFrame([x.values() for x in d["Employees"]],columns=d["Employees"][0].keys())
print(df)

输出

   userId jobTitleName firstName           ...            region  phoneNumber             emailAddress
0  rirani    Developer     Romin           ...                CA  408-1234567  romin.k.irani@gmail.com
1  nirani    Developer      Neil           ...                CA  408-1111111     neilrirani@gmail.com

[2 rows x 9 columns]

答案 2 :(得分:0)

对于给定的特定JSON数据。我的方法仅使用pandas软件包,如下所示:

import pandas as pd

# json as python's dict object
jsn = {
  "Employees" : [
    {
    "userId":"rirani",
    "jobTitleName":"Developer",
    "firstName":"Romin",
    "lastName":"Irani",
    "preferredFullName":"Romin Irani",
    "employeeCode":"E1",
    "region":"CA",
    "phoneNumber":"408-1234567",
    "emailAddress":"romin.k.irani@gmail.com"
    },
    {
    "userId":"nirani",
    "jobTitleName":"Developer",
    "firstName":"Neil",
    "lastName":"Irani",
    "preferredFullName":"Neil Irani",
    "employeeCode":"E2",
    "region":"CA",
    "phoneNumber":"408-1111111",
    "emailAddress":"neilrirani@gmail.com"
    }]
}

# get the main key, here 'Employees' with index '0'
emp = list(jsn.keys())[0]
# when you have several keys at this level, i.e. 'Employers' for example
# .. you need to handle all of them too (your task)

# get all the sub-keys of the main key[0] 
all_keys = jsn[emp][0].keys()

# build dataframe
result_df = pd.DataFrame()  # init a dataframe
for key in all_keys:
    col_vals = []
    for ea in jsn[emp]:
        col_vals.append(ea[key])
    # add a new column to the dataframe using sub-key as its header
    # it is possible that values here is a nested object(s)
    # .. such as dict, list, json
    result_df[key]=col_vals

print(result_df.to_string())

输出:

   userId lastName jobTitleName  phoneNumber             emailAddress employeeCode preferredFullName firstName region
0  rirani    Irani    Developer  408-1234567  romin.k.irani@gmail.com           E1       Romin Irani     Romin     CA
1  nirani    Irani    Developer  408-1111111     neilrirani@gmail.com           E2        Neil Irani      Neil     CA