我有一个json文件,如下所示
{
"question": "yellow skin around wound from cat bite. why?",
"answer": "this may be the secondary result of a resolving bruise but a cat bite is a potentially serious and complicated wound and should be under the care of a physician.",
"tags": [
"wound care"
]
},
{
"question": "yellow skin around wound from cat bite. why?",
"answer": "see your doctor with all deliberate speed. or go to an urgent care center or a hospital emergency room. do it fast!",
"tags": [
"wound care"
]
},
如您所见,冗余部分仅在键的“问题”部分上,但是答案彼此不同,这意味着此数据是从论坛中提取的,并且对以下内容持有不同的答案:同样的问题,有没有一种方法可以使用pyton消除冗余部分或将答案分组在一起。 谢谢
答案 0 :(得分:2)
某些分组是必需的。许多方法可以做到这一点,包括来自itertools
模块的功能,诸如pandas
之类的外部模块以及其他来源。这是使用内置结构defaultdict
的一种方法:
from collections import defaultdict
import json
data = json.loads(rawdata)
questions = defaultdict(list)
for row in data:
question = row.pop('question')
questions[question].append(row)
结果将是字典questions
(准确地说是defaultdict
),由问题为键,值给出了得到的答案和标签。缺点之一是这会破坏性地更改您原始解析的JSON数据。您可以通过几种方式对此进行补救,为简洁起见,我将省略。
这是questions
字典的简化版本,其结果是:
{'yellow skin ...why?': [{'answer': 'this may be the secondary result of a '
'resolving bruise but a cat bite is a '
'potentially serious and complicated wound '
'and should be under the care of a '
'physician.',
'tags': ['wound care']},
{'answer': 'see your doctor with all deliberate '
'speed. or go to an urgent care center or '
'a hospital emergency room. do it fast!',
'tags': ['wound care']}]}
答案 1 :(得分:0)
您可以在这里使用熊猫
import pandas as pd
a='''[{
"question": "yellow skin around wound from cat bite. why?",
"answer": "this may be the secondary result of a resolving bruise but a cat bite is a potentially serious and complicated wound and should be under the care of a physician.",
"tags": [
"wound care"
]
},
{
"question": "yellow skin around wound from cat bite. why?",
"answer": "see your doctor with all deliberate speed. or go to an urgent care center or a hospital emergency room. do it fast!",
"tags": [
"wound care"
]
}]'''
df = pd.read_json(a)
df.groupby(['question'])['answer'].apply(list).to_dict()