Question

我有一个json文件，如下所示

{
  "question": "yellow skin around wound from cat bite. why?",
  "answer": "this may be the secondary result of a resolving bruise but a cat bite is a potentially serious and complicated wound and should be under the care of a physician.",
  "tags": [
    "wound care"
  ]
},
{
  "question": "yellow skin around wound from cat bite. why?",
  "answer": "see your doctor with all deliberate speed. or go to an urgent care center or a hospital emergency room. do it fast!",
  "tags": [
    "wound care"
  ]
},

如您所见，冗余部分仅在键的“问题”部分上，但是答案彼此不同，这意味着此数据是从论坛中提取的，并且对以下内容持有不同的答案：同样的问题，有没有一种方法可以使用pyton消除冗余部分或将答案分组在一起。谢谢

Answer 1

某些分组是必需的。许多方法可以做到这一点，包括来自itertools模块的功能，诸如pandas之类的外部模块以及其他来源。这是使用内置结构defaultdict的一种方法：

from collections import defaultdict
import json

data = json.loads(rawdata)
questions = defaultdict(list)
for row in data:
    question = row.pop('question')
    questions[question].append(row)

结果将是字典questions（准确地说是defaultdict），由问题为键，值给出了得到的答案和标签。缺点之一是这会破坏性地更改您原始解析的JSON数据。您可以通过几种方式对此进行补救，为简洁起见，我将省略。

这是questions字典的简化版本，其结果是：

{'yellow skin ...why?': [{'answer': 'this may be the secondary result of a '
                                    'resolving bruise but a cat bite is a '
                                    'potentially serious and complicated wound '
                                    'and should be under the care of a '
                                    'physician.',
                          'tags': ['wound care']},
                         {'answer': 'see your doctor with all deliberate '
                                    'speed. or go to an urgent care center or '
                                    'a hospital emergency room. do it fast!',
                          'tags': ['wound care']}]}

Answer 2

您可以在这里使用熊猫

import pandas as pd
a='''[{
  "question": "yellow skin around wound from cat bite. why?",
  "answer": "this may be the secondary result of a resolving bruise but a cat bite is a potentially serious and complicated wound and should be under the care of a physician.",
  "tags": [
    "wound care"
  ]
},
{
  "question": "yellow skin around wound from cat bite. why?",
  "answer": "see your doctor with all deliberate speed. or go to an urgent care center or a hospital emergency room. do it fast!",
  "tags": [
    "wound care"
  ]
}]'''
df = pd.read_json(a)
df.groupby(['question'])['answer'].apply(list).to_dict()

如何使用python消除json文件中的冗余

2 个答案: