我正在寻找一种Python技术来从pandas数据框中的平面表构建嵌套的JSON文件。例如,pandas数据框表如何如下:
teamname member firstname lastname orgname phone mobile
0 1 0 John Doe Anon 916-555-1234
1 1 1 Jane Doe Anon 916-555-4321 916-555-7890
2 2 0 Mickey Moose Moosers 916-555-0000 916-555-1111
3 2 1 Minny Moose Moosers 916-555-2222
获取并导出为看起来像的JSON:
{
"teams": [
{
"teamname": "1",
"members": [
{
"firstname": "John",
"lastname": "Doe",
"orgname": "Anon",
"phone": "916-555-1234",
"mobile": "",
},
{
"firstname": "Jane",
"lastname": "Doe",
"orgname": "Anon",
"phone": "916-555-4321",
"mobile": "916-555-7890",
}
]
},
{
"teamname": "2",
"members": [
{
"firstname": "Mickey",
"lastname": "Moose",
"orgname": "Moosers",
"phone": "916-555-0000",
"mobile": "916-555-1111",
},
{
"firstname": "Minny",
"lastname": "Moose",
"orgname": "Moosers",
"phone": "916-555-2222",
"mobile": "",
}
]
}
]
}
我试过通过创建一个dicts的字典并转储到JSON来做到这一点。这是我目前的代码:
data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8')
memberDictTuple = []
for index, row in data.iterrows():
dataRow = row
rowDict = dict(zip(columnList[2:], dataRow[2:]))
teamRowDict = {columnList[0]:int(dataRow[0])}
memberId = tuple(row[1:2])
memberId = memberId[0]
teamName = tuple(row[0:1])
teamName = teamName[0]
memberDict1 = {int(memberId):rowDict}
memberDict2 = {int(teamName):memberDict1}
memberDictTuple.append(memberDict2)
memberDictTuple = tuple(memberDictTuple)
formattedJson = json.dumps(memberDictTuple, indent = 4, sort_keys = True)
print formattedJson
这会产生以下输出。每个项目都嵌套在“teamname”1或2下的正确级别,但如果记录具有相同的团队名称,则应将它们嵌套在一起。我该如何解决这个问题,以便teamname 1和teamname 2每个都有2个嵌套在<?p>中的记录
[
{
"1": {
"0": {
"email": "john.doe@wildlife.net",
"firstname": "John",
"lastname": "Doe",
"mobile": "none",
"orgname": "Anon",
"phone": "916-555-1234"
}
}
},
{
"1": {
"1": {
"email": "jane.doe@wildlife.net",
"firstname": "Jane",
"lastname": "Doe",
"mobile": "916-555-7890",
"orgname": "Anon",
"phone": "916-555-4321"
}
}
},
{
"2": {
"0": {
"email": "mickey.moose@wildlife.net",
"firstname": "Mickey",
"lastname": "Moose",
"mobile": "916-555-1111",
"orgname": "Moosers",
"phone": "916-555-0000"
}
}
},
{
"2": {
"1": {
"email": "minny.moose@wildlife.net",
"firstname": "Minny",
"lastname": "Moose",
"mobile": "none",
"orgname": "Moosers",
"phone": "916-555-2222"
}
}
}
]
答案 0 :(得分:1)
这是一个可以工作并创建所需JSON格式的解决方案。首先,我通过适当的列对数据帧进行分组,然后不是为每个列标题/记录对创建字典(并且丢失数据顺序),而是将它们创建为元组列表,然后将列表转换为有序字典。另一个Ordered Dict是为两列创建的,其他所有列都按其分组。列表和有序dicts之间的精确分层对于JSON转换生成正确的格式是必要的。另请注意,转储到JSON时,sort_keys必须设置为false,否则所有Ordered Dicts将重新排列为字母顺序。
import pandas
import json
from collections import OrderedDict
inputExcel = 'E:\\teams.xlsx'
exportJson = 'E:\\teams.json'
data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8')
# This creates a tuple of column headings for later use matching them with column data
cols = []
columnList = list(data[0:])
for col in columnList:
cols.append(str(col))
columnList = tuple(cols)
#This groups the dataframe by the 'teamname' and 'members' columns
grouped = data.groupby(['teamname', 'members']).first()
#This creates a reference to the index level of the groups
groupnames = data.groupby(["teamname", "members"]).grouper.levels
tm = (groupnames[0])
#Create a list to add team records to at the end of the first 'for' loop
teamsList = []
for teamN in tm:
teamN = int(teamN) #added this in to prevent TypeError: 1 is not JSON serializable
tempList = [] #Create an temporary list to add each record to
for index, row in grouped.iterrows():
dataRow = row
if index[0] == teamN: #Select the record in each row of the grouped dataframe if its index matches the team number
#In order to have the JSON records come out in the same order, I had to first create a list of tuples, then convert to and Ordered Dict
rowDict = ([(columnList[2], dataRow[0]), (columnList[3], dataRow[1]), (columnList[4], dataRow[2]), (columnList[5], dataRow[3]), (columnList[6], dataRow[4]), (columnList[7], dataRow[5])])
rowDict = OrderedDict(rowDict)
tempList.append(rowDict)
#Create another Ordered Dict to keep 'teamname' and the list of members from the temporary list sorted
t = ([('teamname', str(teamN)), ('members', tempList)])
t= OrderedDict(t)
#Append the Ordered Dict to the emepty list of teams created earlier
ListX = t
teamsList.append(ListX)
#Create a final dictionary with a single item: the list of teams
teams = {"teams":teamsList}
#Dump to JSON format
formattedJson = json.dumps(teams, indent = 1, sort_keys = False) #sort_keys MUST be set to False, or all dictionaries will be alphebetized
formattedJson = formattedJson.replace("NaN", '"NULL"') #"NaN" is the NULL format in pandas dataframes - must be replaced with "NULL" to be a valid JSON file
print formattedJson
#Export to JSON file
parsed = open(exportJson, "w")
parsed.write(formattedJson)
print"\n\nExport to JSON Complete"
答案 1 :(得分:0)
根据@root的一些输入,我使用了不同的方法,并提出了以下代码,这似乎是大部分的方式:
import pandas
import json
from collections import defaultdict
inputExcel = 'E:\\teamsMM.xlsx'
exportJson = 'E:\\teamsMM.json'
data = pandas.read_excel(inputExcel, sheetname = 'SCAT Teams', encoding = 'utf8')
grouped = data.groupby(['teamname', 'members']).first()
results = defaultdict(lambda: defaultdict(dict))
for t in grouped.itertuples():
for i, key in enumerate(t.Index):
if i ==0:
nested = results[key]
elif i == len(t.Index) -1:
nested[key] = t
else:
nested = nested[key]
formattedJson = json.dumps(results, indent = 4)
formattedJson = '{\n"teams": [\n' + formattedJson +'\n]\n }'
parsed = open(exportJson, "w")
parsed.write(formattedJson)
生成的JSON文件是:
{
"teams": [
{
"1": {
"0": [
[
1,
0
],
"John",
"Doe",
"Anon",
"916-555-1234",
"none",
"john.doe@wildlife.net"
],
"1": [
[
1,
1
],
"Jane",
"Doe",
"Anon",
"916-555-4321",
"916-555-7890",
"jane.doe@wildlife.net"
]
},
"2": {
"0": [
[
2,
0
],
"Mickey",
"Moose",
"Moosers",
"916-555-0000",
"916-555-1111",
"mickey.moose@wildlife.net"
],
"1": [
[
2,
1
],
"Minny",
"Moose",
"Moosers",
"916-555-2222",
"none",
"minny.moose@wildlife.net"
]
}
}
]
}
此格式非常接近所需的最终产品。剩下的问题是:删除每个名字上方出现的冗余数组[1,0],并使每个嵌套的标题为“teamname”:“1”, “成员”:而不是“1”:“0”:
另外,我不知道为什么每条记录都被剥夺了转换标题。例如,为什么字典条目“firstname”:“John”导出为“John”。