规范化/拼合非常嵌套的JSON(其中名称和属性在各个级别上相同)

时间:2019-11-06 16:23:11

标签: json pandas csv normalization flatten

我正在尝试使用熊猫将这个非常嵌套的json扁平化或标准化为数据框。

问题是:在每个级别,名称和属性都相同。

我没有发现任何与此类似的大熊猫问题。但是我确实看到了2个类似的问题,但是它在R和JavaScript中: Normalize deeply nested objectsNormalize deeply nested objects 我不知道您是否可以从中得到启发。

我的原始文件是40M。因此,这里是一个示例:

data = [
  {
    "id": "haha",
    "type": "table",
    "composition": [
      {
        "id": "AO",
        "type": "basket",
      },
      {
        "id": "KK",
        "type": "basket",
#         "isAutoDiv": false,
        "composition": [
          {
            "id": "600",
            "type": "apple",
            "num": 1.116066714
          },
          {
            "id": "605",
            "type": "apple",
            "num": 1.1166976714
          }
        ]
      }
    ]
  },
  {
    "id": "hoho",
    "type": "table",
    "composition": [
      {
        "id": "KT",
        "type": "basket"
      },
      {
        "id": "OT",
        "type": "basket"
      },
      {
        "id": "CL",
        "type": "basket",
#         "isAutoDiv": false,
        "composition": [
          {
            "id": "450",
            "type": "apple"
          },
          {
            "id": "630",
            "type": "apple"
          },
          {
            "id": "023",
            "type": "index",
            "composition": [
              {
                "id": "AAOAAOAOO",
                "type": "applejuice"
              },
              {
                "id": "MMNMMNNM",
                "type": "applejuice"
              },
            ]
          }
        ]
      }
    ]
  }
]

你看到了吗?每个级别的名称和属性都相同。

我用这行对其进行规范化。但是当它们具有相同的名称和属性时,我不知道如何对嵌套在嵌套对象中的对象进行标准化:

df = json_normalize(data, record_path = ['composition'], meta = ['id', 'type'], record_prefix = 'compo_')
                                   compo_composition compo_id compo_type id     type
0                                                NaN      AO    basket  haha    table
1   [{'id': '600', 'type': 'apple', 'num': 1.11606...     KK    basket  haha    table
2                                                NaN      KT    basket  hoho    table
3                                                NaN      OT    basket  hoho    table
4   [{'id': '450', 'type': 'apple'}, {'id': '630',...     CL    basket  hoho    table

您会在“ compo_composition”列中看到仍然有嵌套对象。

现在我希望它具有这些列:

compo_compo_compo__id   compo_compo_compo_type   compo_compo__id   compo_compo_type   compo_id  compo_type  id  type

多谢。这让我好几天都感到沮丧,而且我在任何地方都找不到答案。

1 个答案:

答案 0 :(得分:0)

您必须编写您的自定义解析器。假设(a)您的JSON很深,并且(b)路径上的每个元素都是唯一的(ala table > basket > index,而不是table > table > basket

# Make a copy so we do not change the original data
tmp = data.copy()
compositions = []

while len(tmp) > 0:
    item = tmp.pop(0)

    if 'composition' in item:
        # If a level has children, add that level's `id` 
        # to the path and process its children
        path = item.get('path', {})
        path[item['type'] + '_id'] = item['id']

        children = [
            {'path': path, **child} for child in item.get('composition', [])
        ]
        tmp += children
    else:
        # If a level has no child, we are done
        compositions += [item]

最后一个数据帧:

df = pd.DataFrame([c['path'] for c in compositions]) \
        .join(pd.DataFrame(compositions)) \
        .drop(columns='path')

结果:

  table_id basket_id index_id         id        type       num
0     haha        KK      NaN         AO      basket       NaN
1     hoho        CL      023         KT      basket       NaN
2     hoho        CL      023         OT      basket       NaN
3     haha        KK      NaN        600       apple  1.116067
4     haha        KK      NaN        605       apple  1.116698
5     hoho        CL      023        450       apple       NaN
6     hoho        CL      023        630       apple       NaN
7     hoho        CL      023  AAOAAOAOO  applejuice       NaN
8     hoho        CL      023   MMNMMNNM  applejuice       NaN