我目前正在一个项目中,该项目将分析多个数据源以获取信息,其他数据源也很好,但是json及其有时深度嵌套的结构给我带来很多麻烦。我曾尝试将json转换成python字典,但运气不佳,因为它变得越来越复杂,可能会开始挣扎。例如,此示例json文件:
{
"Employees": [
{
"userId": "rirani",
"jobTitleName": "Developer",
"firstName": "Romin",
"lastName": "Irani",
"preferredFullName": "Romin Irani",
"employeeCode": "E1",
"region": "CA",
"phoneNumber": "408-1234567",
"emailAddress": "romin.k.irani@gmail.com"
},
{
"userId": "nirani",
"jobTitleName": "Developer",
"firstName": "Neil",
"lastName": "Irani",
"preferredFullName": "Neil Irani",
"employeeCode": "E2",
"region": "CA",
"phoneNumber": "408-1111111",
"emailAddress": "neilrirani@gmail.com"
}
]
}
转换为字典并执行dict.keys()
后的仅返回“雇员”。
然后我求助于选择熊猫数据框,可以通过调用json_normalize(dict['Employees'], sep="_")
来实现我想要的功能,但是我的问题是它必须适用于所有json,并且不能事先查看数据,因此我的归一化方法这种方式并不总是有效。有什么办法可以编写某种可以将任何json转换为漂亮的pandas数据框的函数?我已经搜索了大约2周的时间来寻找答案bt,但对我的具体问题没有任何运气。谢谢
答案 0 :(得分:0)
过去,我不得不这样做(展开一个大的嵌套json)。 blog确实很有帮助。这样的事情对您有用吗?
请注意,就像其他人所说的那样,要使它适用于每个JSON,这都是一项艰巨的任务,如果您有更多的json格式对象,我只是提供一种入门方法。我以为它们与您作为示例发布的内容相对接近,希望结构类似。)
jsonStr = '''{
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani@gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani@gmail.com"
}]
}'''
它将整个json展平为单行,然后可以放入数据框。在这种情况下,它将创建18列的1行。然后使用这些列名称中的数字值遍历这些列,以重构为多行。如果您使用其他嵌套的json,则我认为它在理论上应该可以工作,但是您必须对其进行测试。
import json
import pandas as pd
import re
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
jsonObj = json.loads(jsonStr)
flat = flatten_json(jsonObj)
results = pd.DataFrame()
columns_list = list(flat.keys())
for item in columns_list:
row_idx = re.findall(r'\_(\d+)\_', item )[0]
column = item.replace('_'+row_idx+'_', '_')
row_idx = int(row_idx)
value = flat[item]
results.loc[row_idx, column] = value
print (results)
输出:
print (results)
Employees_userId ... Employees_emailAddress
0 rirani ... romin.k.irani@gmail.com
1 nirani ... neilrirani@gmail.com
[2 rows x 9 columns]
答案 1 :(得分:0)
d={
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani@gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani@gmail.com"
}]
}
import pandas as pd
df=pd.DataFrame([x.values() for x in d["Employees"]],columns=d["Employees"][0].keys())
print(df)
输出
userId jobTitleName firstName ... region phoneNumber emailAddress
0 rirani Developer Romin ... CA 408-1234567 romin.k.irani@gmail.com
1 nirani Developer Neil ... CA 408-1111111 neilrirani@gmail.com
[2 rows x 9 columns]
答案 2 :(得分:0)
对于给定的特定JSON数据。我的方法仅使用pandas
软件包,如下所示:
import pandas as pd
# json as python's dict object
jsn = {
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani@gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani@gmail.com"
}]
}
# get the main key, here 'Employees' with index '0'
emp = list(jsn.keys())[0]
# when you have several keys at this level, i.e. 'Employers' for example
# .. you need to handle all of them too (your task)
# get all the sub-keys of the main key[0]
all_keys = jsn[emp][0].keys()
# build dataframe
result_df = pd.DataFrame() # init a dataframe
for key in all_keys:
col_vals = []
for ea in jsn[emp]:
col_vals.append(ea[key])
# add a new column to the dataframe using sub-key as its header
# it is possible that values here is a nested object(s)
# .. such as dict, list, json
result_df[key]=col_vals
print(result_df.to_string())
输出:
userId lastName jobTitleName phoneNumber emailAddress employeeCode preferredFullName firstName region
0 rirani Irani Developer 408-1234567 romin.k.irani@gmail.com E1 Romin Irani Romin CA
1 nirani Irani Developer 408-1111111 neilrirani@gmail.com E2 Neil Irani Neil CA