如何将深度嵌套的JSON读入数据框?

时间:2020-05-05 14:12:08

标签: python json dataframe

我有很多想要可视化的JSON请求。 JSON请求保存在.blob文件中。问题在于JSON请求是深层嵌套的。我无法找出有效的代码段将所有数据写入数据框。

这是我当前的代码,它可以工作,但是效率不高。

path_to_blob = '/mnt/data/'
read_files = glob.iglob(os.path.join(path_to_blob, "**/*.blob"), recursive=True)

np_array_values = []
for files in read_files:
    data = [json.loads(line) for line in open(files, encoding="utf8")]

    all_data = json_normalize(data)
    request_data = json_normalize(data, record_path=['request'])
    dataframes = [request_data, all_data]
    dataset = pd.concat(dataframes, axis=1)

    np_array_values.append(dataset)
dataframe = pd.concat(np_array_values)

这是请求之一:

{"request":[{"id":"12345678","name":"GET navigation/Index","count":123,"responseCode":123,"success":true,"url":"http://server1.test.com/12345678","urlData":{"base":"/navigation/123456","host":"server1.test.com","hashTag":"","protocol":"http"},"durationMetric":{"value":12345.0,"count":123.0,"min":12345.0,"max":12345.0,"stdDev":0.0,"sampledValue":12345.0}}],"internal":{"data":{"id":"12345678","documentVersion":"123.0"}},"context":{"data":{"eventTime":"2020-5-5","isSynthetic":false,"samplingRate":123.0},"cloud":{},"device":{"type":"PC","roleName":"ROLENAME","roleInstance":"SERVERNAME","screenResolution":{}},"session":{"isFirst":false},"operation":{"id":"12345678=","parentId":"12345678=","name":"GET navigation/url"},"location":{"clientip":"0.0.0.0","continent":"Europe","country":"Netherlands"},"custom":{"dimensions":[{"_MS.ProcessedByMetricExtractors":"(Name:'Requests', Ver:'123.0')"},{"InstanceKey":"12345678"}]}}}

我最近阅读了有关dask的内容,使用dask似乎是明智的做法,因为数据集为1.2TB。有人可以告诉我如何在DataFrame中获取此嵌套的JSON请求吗?

谢谢!

1 个答案:

答案 0 :(得分:0)

Python的自由只是因为这个问题而被忽略。您要搜索的是

json.loads()

此代码:

import json
from pprint import pprint

with open("test.json", "r") as rf:
    jx = rf.read()
    jx = json.loads(jx)
pprint(jx)

使您返回字典:

{'context': {'cloud': {},
             'custom': {'dimensions': [{'_MS.ProcessedByMetricExtractors': "(Name:'Requests', "
                                                                           "Ver:'123.0')"},
                                       {'InstanceKey': '12345678'}]},
             'data': {'eventTime': '2020-5-5',
                      'isSynthetic': False,
                      'samplingRate': 123.0},
             'device': {'roleInstance': 'SERVERNAME',
                        'roleName': 'ROLENAME',
                        'screenResolution': {},
                        'type': 'PC'},
             'location': {'clientip': '0.0.0.0',
                          'continent': 'Europe',
                          'country': 'Netherlands'},
             'operation': {'id': '12345678=',
                           'name': 'GET navigation/url',
                           'parentId': '12345678='},
             'session': {'isFirst': False}},
 'internal': {'data': {'documentVersion': '123.0', 'id': '12345678'}},
 'request': [{'count': 123,
              'durationMetric': {'count': 123.0,
                                 'max': 12345.0,
                                 'min': 12345.0,
                                 'sampledValue': 12345.0,
                                 'stdDev': 0.0,
                                 'value': 12345.0},
              'id': '12345678',
              'name': 'GET navigation/Index',
              'responseCode': 123,
              'success': True,
              'url': 'http://server1.test.com/12345678',
              'urlData': {'base': '/navigation/123456',
                          'hashTag': '',
                          'host': 'server1.test.com',
                          'protocol': 'http'}}]}