将复杂/扁平化的JSON转换为DataFrame

时间:2020-07-15 15:54:46

标签: python json pandas dataframe

我有一个复杂/嵌套的JSON,我需要将其转换为DataFrame(Python)。我可以得到第一部分,但是我正在努力解决第二部分。

import requests
from pandas.io.json import json_normalize
import json

url = 'url'

headers = {'api-key':'key'}

resp = requests.get(url, headers = headers)
print(resp.status_code)

r = resp.content
r

responses = json.loads(r.decode('utf-8'))
responses

输出(响应)

{'count': 855,
 'requestAt': '2020-07-15T13:13:26.646+00:00',
 'data': {'00b3dc3a-b71e-4547-8910-44691a09cd53': {'registerId': '00b3dc3a-b71e-4547-8910-44691a09cd53',
   'count': 10,
   'milho_germoplasma': {'feedbackScore': 'good',
    'firstVisitAt': '2020-06-11T11:10:42.929-03:00',
    'lastVisitAt': '2020-06-15T15:36:43.027-03:00',
    'videosCompletedAt': '2020-06-11T11:19:58.753-03:00',
    'videosState': [{'completedAt': '2020-06-11T11:19:58.753-03:00',
      'completedCount': 1,
      'duration': 544.811,
      'firstPlayAt': '2020-06-11T11:10:50.170-03:00',
      'percent': 0.281,
      'playCount': 3,
      'seconds': 152.85,
      'updatedAt': '2020-06-15T15:38:13.711-03:00',
      'videoSrc': 'https://vimeo.com/420453289/b7c455699a'}],
    'visitsCount': 3,
    'stationId': 'milho_germoplasma'},
   'milho_plantio': {'feedbackScore': 'good',
    'firstVisitAt': '2020-06-11T10:37:42.509-03:00',
    'lastVisitAt': '2020-06-11T12:28:21.105-03:00',
    'videosCompletedAt': '2020-06-11T10:49:43.082-03:00',
    'videosState': [{'completedAt': '2020-06-11T10:49:43.082-03:00',
      'completedCount': 1,
      'duration': 700.459,
      'firstPlayAt': '2020-06-11T10:37:50.465-03:00',
      'percent': 0.042,
      'playCount': 2,
      'seconds': 29.18,
      'updatedAt': '2020-06-11T10:50:18.717-03:00',
      'videoSrc': 'https://player.vimeo.com/video/412760474'}],
    'visitsCount': 2,
    'stationId': 'milho_plantio'}}}}

我试图在StackOverflow上使用一些响应的改编,但我可以解决其中的一部分而不会出现错误:

response_list = []
for id in responses['data']:

    # get the keys of interest
    data = {k: v for k, v in responses['data'][id].items() if k in ['registerId', 'count']}

    response_list.append({**data})      

print(pd.DataFrame(response_list))

输出:

+--------------------------------------+-------+
|             registerId               | count |
+--------------------------------------+-------+
| 00b3dc3a-b71e-4547-8910-44691a09cd53 |    10 |
+--------------------------------------+-------+

我需要进入此json的下一层并将其转换为DataFrame: (每个milho_germoplasma / milho_plantio /无论使用相同的内部数据为同一registerId创建新行)

预期输出:

+--------------------------------------+-------+---------------+-------------------------------+----------------------------------+-------------------+
|              registerId              | count | feedbackScore |         firstVisitAt          |           lastVisitAt            |  …(last column)   |
+--------------------------------------+-------+---------------+-------------------------------+----------------------------------+-------------------+
| 00b3dc3a-b71e-4547-8910-44691a09cd53 |    10 | good          | 2020-06-11T11:10:42.929-03:00 | '2020-06-15T15:36:43.027-03:00', | milho_germoplasma |
| 00b3dc3a-b71e-4547-8910-44691a09cd53 |    10 | good          | 2020-06-11T10:37:42.509-03:00 | 2020-06-11T12:28:21.105-03:00    | milho_plantio     |
+--------------------------------------+-------+---------------+-------------------------------+----------------------------------+-------------------+

1 个答案:

答案 0 :(得分:1)

解压缩嵌套的json并非易事,您可以使用递归方法来解决此问题。

如果您有一个固定的json结构(如您所示),则下面是一个更简单的方法。

import pandas as pd
def unpack(data):
    f = {}
    for k,v in data.items():
        if isinstance(v, (int, float, str)):
            if k in f.keys():
                f[k].append(v)
            else:
                f[k] = [v]
        elif isinstance(v, (list, tuple)):
            for ele in v:
                if isinstance(ele, (dict)):
                    for k2, val in ele.items():
                        key = f'{k}_{k2}'
                        if key in f.keys():
                            f[key].append(val)
                        else:
                            f[key] = [val]
    return f

for _id in e['data']:
    data = e['data'].get(_id)
    registerID = data.pop('registerId') if 'registerId' in data else None
    count = e['data'].get(_id).pop('count') if 'count' in data else 0
    dfs = []
    for specie in data.keys():
        f = unpack(data.get(specie))
        aux_df = pd.DataFrame(f)
        aux_df['registerID'] = registerID
        aux_df['count'] = count
        dfs.append(aux_df)

df = pd.concat(dfs)
print(df)

结果:

  feedbackScore                   firstVisitAt                    lastVisitAt  \
0          good  2020-06-11T11:10:42.929-03:00  2020-06-15T15:36:43.027-03:00   
0          good  2020-06-11T10:37:42.509-03:00  2020-06-11T12:28:21.105-03:00   

               videosCompletedAt        videosState_completedAt  \
0  2020-06-11T11:19:58.753-03:00  2020-06-11T11:19:58.753-03:00   
0  2020-06-11T10:49:43.082-03:00  2020-06-11T10:49:43.082-03:00   

   videosState_completedCount  videosState_duration  \
0                           1               544.811   
0                           1               700.459   

         videosState_firstPlayAt  videosState_percent  videosState_playCount  \
0  2020-06-11T11:10:50.170-03:00                0.281                      3   
0  2020-06-11T10:37:50.465-03:00                0.042                      2   

   videosState_seconds          videosState_updatedAt  \
0               152.85  2020-06-15T15:38:13.711-03:00   
0                29.18  2020-06-11T10:50:18.717-03:00   

                       videosState_videoSrc  visitsCount          stationId  \
0    https://vimeo.com/420453289/b7c455699a            3  milho_germoplasma   
0  https://player.vimeo.com/video/412760474            2      milho_plantio   

                             registerID  count  
0  00b3dc3a-b71e-4547-8910-44691a09cd53     10  
0  00b3dc3a-b71e-4547-8910-44691a09cd53     10