我有一个复杂/嵌套的JSON,我需要将其转换为DataFrame(Python)。我可以得到第一部分,但是我正在努力解决第二部分。
import requests
from pandas.io.json import json_normalize
import json
url = 'url'
headers = {'api-key':'key'}
resp = requests.get(url, headers = headers)
print(resp.status_code)
r = resp.content
r
responses = json.loads(r.decode('utf-8'))
responses
输出(响应)
{'count': 855,
'requestAt': '2020-07-15T13:13:26.646+00:00',
'data': {'00b3dc3a-b71e-4547-8910-44691a09cd53': {'registerId': '00b3dc3a-b71e-4547-8910-44691a09cd53',
'count': 10,
'milho_germoplasma': {'feedbackScore': 'good',
'firstVisitAt': '2020-06-11T11:10:42.929-03:00',
'lastVisitAt': '2020-06-15T15:36:43.027-03:00',
'videosCompletedAt': '2020-06-11T11:19:58.753-03:00',
'videosState': [{'completedAt': '2020-06-11T11:19:58.753-03:00',
'completedCount': 1,
'duration': 544.811,
'firstPlayAt': '2020-06-11T11:10:50.170-03:00',
'percent': 0.281,
'playCount': 3,
'seconds': 152.85,
'updatedAt': '2020-06-15T15:38:13.711-03:00',
'videoSrc': 'https://vimeo.com/420453289/b7c455699a'}],
'visitsCount': 3,
'stationId': 'milho_germoplasma'},
'milho_plantio': {'feedbackScore': 'good',
'firstVisitAt': '2020-06-11T10:37:42.509-03:00',
'lastVisitAt': '2020-06-11T12:28:21.105-03:00',
'videosCompletedAt': '2020-06-11T10:49:43.082-03:00',
'videosState': [{'completedAt': '2020-06-11T10:49:43.082-03:00',
'completedCount': 1,
'duration': 700.459,
'firstPlayAt': '2020-06-11T10:37:50.465-03:00',
'percent': 0.042,
'playCount': 2,
'seconds': 29.18,
'updatedAt': '2020-06-11T10:50:18.717-03:00',
'videoSrc': 'https://player.vimeo.com/video/412760474'}],
'visitsCount': 2,
'stationId': 'milho_plantio'}}}}
我试图在StackOverflow上使用一些响应的改编,但我可以解决其中的一部分而不会出现错误:
response_list = []
for id in responses['data']:
# get the keys of interest
data = {k: v for k, v in responses['data'][id].items() if k in ['registerId', 'count']}
response_list.append({**data})
print(pd.DataFrame(response_list))
输出:
+--------------------------------------+-------+
| registerId | count |
+--------------------------------------+-------+
| 00b3dc3a-b71e-4547-8910-44691a09cd53 | 10 |
+--------------------------------------+-------+
我需要进入此json的下一层并将其转换为DataFrame: (每个milho_germoplasma / milho_plantio /无论使用相同的内部数据为同一registerId创建新行)
预期输出:
+--------------------------------------+-------+---------------+-------------------------------+----------------------------------+-------------------+
| registerId | count | feedbackScore | firstVisitAt | lastVisitAt | …(last column) |
+--------------------------------------+-------+---------------+-------------------------------+----------------------------------+-------------------+
| 00b3dc3a-b71e-4547-8910-44691a09cd53 | 10 | good | 2020-06-11T11:10:42.929-03:00 | '2020-06-15T15:36:43.027-03:00', | milho_germoplasma |
| 00b3dc3a-b71e-4547-8910-44691a09cd53 | 10 | good | 2020-06-11T10:37:42.509-03:00 | 2020-06-11T12:28:21.105-03:00 | milho_plantio |
+--------------------------------------+-------+---------------+-------------------------------+----------------------------------+-------------------+
答案 0 :(得分:1)
解压缩嵌套的json并非易事,您可以使用递归方法来解决此问题。
如果您有一个固定的json结构(如您所示),则下面是一个更简单的方法。
import pandas as pd
def unpack(data):
f = {}
for k,v in data.items():
if isinstance(v, (int, float, str)):
if k in f.keys():
f[k].append(v)
else:
f[k] = [v]
elif isinstance(v, (list, tuple)):
for ele in v:
if isinstance(ele, (dict)):
for k2, val in ele.items():
key = f'{k}_{k2}'
if key in f.keys():
f[key].append(val)
else:
f[key] = [val]
return f
for _id in e['data']:
data = e['data'].get(_id)
registerID = data.pop('registerId') if 'registerId' in data else None
count = e['data'].get(_id).pop('count') if 'count' in data else 0
dfs = []
for specie in data.keys():
f = unpack(data.get(specie))
aux_df = pd.DataFrame(f)
aux_df['registerID'] = registerID
aux_df['count'] = count
dfs.append(aux_df)
df = pd.concat(dfs)
print(df)
结果:
feedbackScore firstVisitAt lastVisitAt \
0 good 2020-06-11T11:10:42.929-03:00 2020-06-15T15:36:43.027-03:00
0 good 2020-06-11T10:37:42.509-03:00 2020-06-11T12:28:21.105-03:00
videosCompletedAt videosState_completedAt \
0 2020-06-11T11:19:58.753-03:00 2020-06-11T11:19:58.753-03:00
0 2020-06-11T10:49:43.082-03:00 2020-06-11T10:49:43.082-03:00
videosState_completedCount videosState_duration \
0 1 544.811
0 1 700.459
videosState_firstPlayAt videosState_percent videosState_playCount \
0 2020-06-11T11:10:50.170-03:00 0.281 3
0 2020-06-11T10:37:50.465-03:00 0.042 2
videosState_seconds videosState_updatedAt \
0 152.85 2020-06-15T15:38:13.711-03:00
0 29.18 2020-06-11T10:50:18.717-03:00
videosState_videoSrc visitsCount stationId \
0 https://vimeo.com/420453289/b7c455699a 3 milho_germoplasma
0 https://player.vimeo.com/video/412760474 2 milho_plantio
registerID count
0 00b3dc3a-b71e-4547-8910-44691a09cd53 10
0 00b3dc3a-b71e-4547-8910-44691a09cd53 10