将嵌套的JSON数据查看到pandas数据帧

时间:2016-06-23 22:04:35

标签: python json pandas dataframe

我现在已将当前问题添加到GitHib中。请找到repo的URL。我已经加入了一个Jupyter笔记本,也解释了这个问题。谢谢你们。

https://github.com/simongraham/dataExplore.git

我目前正在处理项目的营养数据,其中数据采用原始JSON格式,我想使用python和pandas来获取可理解的数据框。我知道当JSON没有嵌套时,这是一项简单的任务。我会在这里使用:

nutrition = pd.read_json('data')

然而,我有嵌套信息,我发现很难将其纳入合理的数据框架。 JSON格式如下,其中nutrition营养元素本身是嵌套元素。这个元素的巢将描述各种不同的东西的营养成分,如酒精和bcfa,包括在内。我只包含一个示例,因为这是一个大型数据文件。

  [
        {
            "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
            "vcNutritionId": "2476378b-79ee-4857-a81d-489661a039a1",
            "vcUserId": "cc51145b-5a70-4344-9b55-1a4455f0a9d2",
            "vcPortionId": "1",
            "vcPortionName": "1 average pepper",
            "vcPortionSize": "20",
            "ftEnergyKcal": 5.2,
            "vcPortionUnit": "g",
            "dtConsumedDate": "2016-05-04T00:00:00",
            "nutritionNutrients": [
                {
                    "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                    "vcNutrient": "alcohol",
                    "ftValue": 0,
                    "vcUnit": "g",
                    "nPercentRI": 0,
                    "vcTrafficLight": ""
                },
                {
                    "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                    "vcNutrient": "bcfa",
                    "ftValue": 0,
                    "vcUnit": "g",
                    "nPercentRI": 0,
                    "vcTrafficLight": ""
                },
                {
                    "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                    "vcNutrient": "biotin",
                    "ftValue": 0,
                    "vcUnit": "µg",
                    "nPercentRI": 0,
                    "vcTrafficLight": ""
                },
                ...
            ]
        }
    ]

任何帮助都将不胜感激。

感谢。

.... ....

现在我已经找到了如何使用json_normalize解决这个问题,我返回了同样的问题,但这次我的代码嵌套了两次。即:

[
{
...
}
[,
"nutritionPortions": [
    {
        "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
        "vcNutritionId": "2476378b-79ee-4857-a81d-489661a039a1",
        "vcUserId": "cc51145b-5a70-4344-9b55-1a4455f0a9d2",
        "vcPortionId": "1",
        "vcPortionName": "1 average pepper",
        "vcPortionSize": "20",
        "ftEnergyKcal": 5.2,
        "vcPortionUnit": "g",
        "dtConsumedDate": "2016-05-04T00:00:00",
        "nutritionNutrients": [
            {
                "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                "vcNutrient": "alcohol",
                "ftValue": 0,
                "vcUnit": "g",
                "nPercentRI": 0,
                "vcTrafficLight": ""
            },
            {
                "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                "vcNutrient": "bcfa",
                "ftValue": 0,
                "vcUnit": "g",
                "nPercentRI": 0,
                "vcTrafficLight": ""
            },
            {
                "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                "vcNutrient": "biotin",
                "ftValue": 0,
                "vcUnit": "µg",
                "nPercentRI": 0,
                "vcTrafficLight": ""
            },
            ...
           }
          ]
        }
      ]

当我有一个只包含营养数据的JSON时,我可以使用:

nutrition = (pd.io
   .json
   .json_normalize((data, ['nutritionPortions']), 'nutritionNutrients',
        ['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
         'ftEnergyKcal','vcPortionUnit','dtConsumedDate'])
)

但是,我的数据不仅包含营养信息。例如,它将包含活动信息,因此营养信息在开始时与“nutrtitionPortions”嵌套。让我们假设所有其他列都没有嵌套,它们由“Activity”和“Wellbeing”表示。

如果我使用代码:

nutrition = (pd.io
   .json
   .json_normalize(data, ['nutritionPortions'])
)

我将回到“营养素”嵌套的原始问题,但是我没有成功,然后获得相应的数据框。

谢谢

1 个答案:

答案 0 :(得分:4)

更新:这适用于您的kaidoData.json文件:

df = (pd.io
        .json
        .json_normalize(data[0]['ionPortions'], 'nutritionNutrients',
            ['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
             'dtCreatedDate','dtUpdatedDate','nProcessingStatus',
             'vcPortionUnit','dtConsumedDate'
            ]
        )
)

PS我不知道'ftEnergyKcal'有什么问题 - 它让我失望:

  

KeyError:'ftEnergyKcal'

可能在某些部分缺少

OLD回答:

使用json_normalize()

(pd.io
   .json
   .json_normalize(l, 'nutritionNutrients',
        ['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
         'ftEnergyKcal','vcPortionUnit','dtConsumedDate'])
)

演示:

In [107]: (pd.io
   .....:    .json
   .....:    .json_normalize(l, 'nutritionNutrients',
   .....:         ['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
   .....:          'ftEnergyKcal','vcPortionUnit','dtConsumedDate'])
   .....: )
Out[107]:
   ftValue  nPercentRI vcNutrient vcNutritionPortionId vcTrafficLight        ...        vcPortionSize  \
0        0           0    alcohol  478d1905-f264-4d...                       ...                   20
1        0           0       bcfa  478d1905-f264-4d...                       ...                   20
2        0           0     biotin  478d1905-f264-4d...                       ...                   20

         vcNutritionId vcPortionId ftEnergyKcal     vcPortionName
0  2476378b-79ee-48...           1          5.2  1 average pepper
1  2476378b-79ee-48...           1          5.2  1 average pepper
2  2476378b-79ee-48...           1          5.2  1 average pepper

[3 rows x 14 columns]

其中l是您的列表(解析为JSON)