我有一些需要转换为特定 json 格式的 csv 数据。 我编写了一个适用于某些嵌套级别但不符合要求的代码
这是我的 csv 数据:
title context answers question id
tit1 con1 text1 que1 id1
tit1 con1 text2 que2 id2
tit2 con2 text3 que3 id3
tit2 con2 text4 que4 id4
tit2 con3 text5 que5 id5
我的代码:
df = pd.read_csv('processedOutput.csv')
finalList = []
finalDict = {}
grouped = df.groupby(['context'])
for key, value in grouped:
dictionary = {}
j = grouped.get_group(key).reset_index(drop=True)
dictionary['context'] = j.at[0, 'context']
dictList = []
anotherDict = {}
for i in j.index:
anotherDict['answers'] = j.at[i, 'answers']
anotherDict['question'] = j.at[i, 'question']
anotherDict['id'] = j.at[i, 'id']
dictList.append(anotherDict)
dictionary['qas'] = dictList
finalList.append(dictionary)
import json
data = json.dumps(finalList)
其输出结构很好,但只取分组项的最后一个元素
[{"context": "con1",
"qas": [
{"answers": "text2", "question": "que2", "id": "id2"},
{"answers": "text2", "question": "que2", "id": "id2"}
]
},
{"context": "con2",
"qas": [
{"answers": "text4", "question": "que4", "id": "id4"},
{"answers": "text4", "question": "que4", "id": "id4"}
]
},
{"context": "con3",
"qas": [
{"answers": "text5", "question": "que5", "id": "id5"}
]
}
]
想让数据多嵌套一层,所有字段如下:
[
{
"title": "tit1",
"paragraph": [
{
"context": "con1",
"qas": [
{"answers": "text1","question": "que1","id": "id1"},
{"answers": "text2","question": "que2","id": "id2"}
]}]
},
{
"title": "tit2",
"paragraph": [
{
"context": "con2",
"qas": [
{"answers": "text3","question": "que3","id": "id3"},
{"answers": "text4","question": "que4","id": "id4"}
],
"context": "con3",
"qas": [
{"answers": "text5","question":"que5", "id": "id5"}
]
}
]
}
]
坚持了很长时间,任何建议都会很棒
答案 0 :(得分:0)
您的输出数据需要 3 个级别的分组:标题、段落和问答。我建议使用 df.groupby(['title', 'context', 'answers'])
来驱动循环。
然后,在循环中,每组将构成一个问答词典(假设
id
列仅包含唯一值)。为了建立更高层次的结构,
所需要的只是一些簿记来检测级别变化并添加到适当的列表和字典中。我们将使用更多 groupby
级别来执行此操作:
...
g1 = df.groupby(['title'])
for k1, v1 in g1:
l2_para_list = []
l4_qas_list = []
g2 = v1.groupby(['context'])
for k2, v2 in g2:
g3 = v2.groupby(['answers'])
for _, v3 in g3:
qas_dict = {}
qas_dict['answers'] = v3.answers.item()
qas_dict['question'] = v3.question.item()
qas_dict['id'] = v3.id.item()
l4_qas_list.append(qas_dict)
l3_para_dict = {}
l3_para_dict['context'] = k2
l3_para_dict['qas'] = l4_qas_list
l4_qas_list = []
l2_para_list.append(l3_para_dict)
l3_para_dict = {}
l1_title_dict = {}
l1_title_dict['title'] = k1
l1_title_dict['paragraph'] = l2_para_list
finalList.append(l1_title_dict)
l1_title_dict = {}
l2_para_list = []
print(json.dumps(finalList))
...
输出(为演示而格式化)
[{"title": "tit1", "paragraph":
[{"context": "con1",
"qas": [{"answers": "text1", "question": "que1", "id": "id1"},
{"answers": "text2", "question": "que2", "id": "id2"}]}]},
{"title": "tit2", "paragraph":
[{"context": "con2",
"qas": [{"answers": "text3", "question": "que3", "id": "id3"},
{"answers": "text4", "question": "que4", "id": "id4"}]},
{"context": "con3",
"qas": [{"answers": "text5", "question": "que5", "id": "id5"}]}]}]