如何解析此嵌套的JSON对象?

时间:2019-04-07 06:28:56

标签: python json pandas parsing

我有一个数据格式的数据集,看起来像这样:

[{'session_id': ['X061RFWB06K9V'],
  'unix_timestamp': [1442503708],
  'cities': ['New York NY, Newark NJ'],
  'user': [[{'user_id': 2024,
     'joining_date': '2015-03-22',
     'country': 'UK'}]]},
 {'session_id': ['5AZ2X2A9BHH5U'],
  'unix_timestamp': [1441353991],
  'cities': ['New York NY, Jersey City NJ, Philadelphia PA'],
  'user': [[{'user_id': 2853,
     'joining_date': '2015-03-28',
     'country': 'DE'}]]},
 {'session_id': ['SHTB4IYAX4PX6'],
  'unix_timestamp': [1440843490],
  'cities': ['San Antonio TX'],
  'user': [[{'user_id': 10958,
     'joining_date': '2015-03-06',
     'country': 'UK'}]]}

我正在使用熊猫并对其进行处理,当我使用read_json时,会得到以下信息:

          cities                  session_id    unix_timestamp                  user
0   [New York NY, Newark NJ]    [X061RFWB06K9V] [1442503708]    [[{'user_id': 2024, 'joining_date': '2015-03-2...
1   [New York NY, Jersey City NJ, Philadelphia PA]  [5AZ2X2A9BHH5U] [1441353991]    [[{'user_id': 2853, 'joining_date': '2015-03-2...
2   [San Antonio TX]    [SHTB4IYAX4PX6] [1440843490]    [[{'user_id': 10958, 'joining_date': '2015-03-...

我该如何处理这些数据以使其具有更好的格式? 这是数据定义:

列:

  • session_id:会话ID。
  • unix_timestamp:会话开始时间的unix时间戳
  • cities:在同一会话中搜索的唯一城市
  • user
    • user_id:用户的ID
    • joining_date:用户创建帐户时
    • country:用户所在的地方

我尝试使用json_normalize,但始终收到错误消息:

  

AttributeError:“ int”对象没有属性“ values”

以及不同类型的错误。请帮助

1 个答案:

答案 0 :(得分:1)

您可以使用将其完全展平的函数,然后重建数据框:

import re
import pandas as pd
import numpy as np

jsonData = [{'session_id': ['X061RFWB06K9V'],
  'unix_timestamp': [1442503708],
  'cities': ['New York NY, Newark NJ'],
  'user': [[{'user_id': 2024,
     'joining_date': '2015-03-22',
     'country': 'UK'}]]},
 {'session_id': ['5AZ2X2A9BHH5U'],
  'unix_timestamp': [1441353991],
  'cities': ['New York NY, Jersey City NJ, Philadelphia PA'],
  'user': [[{'user_id': 2853,
     'joining_date': '2015-03-28',
     'country': 'DE'}]]},
 {'session_id': ['SHTB4IYAX4PX6'],
  'unix_timestamp': [1440843490],
  'cities': ['San Antonio TX'],
  'user': [[{'user_id': 10958,
     'joining_date': '2015-03-06',
     'country': 'UK'}]]} ]



def flatten_json(y):
    out = {}
    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x
    flatten(y)
    return out

flat = flatten_json(jsonData)


results = pd.DataFrame()
columns_list = list(flat.keys())
for item in columns_list:
    row_idx = re.findall(r'(\d+)\_', item )[0]
    column = item.replace(row_idx+'_', '',1)
    column = column.replace('_0', '')
    row_idx = int(row_idx)
    value = flat[item]

    results.loc[row_idx, column] = value

# If you don't want to expand/split the `cities` column, remove line below
results = results.join(results['cities'].str.split(',', expand=True).add_prefix('cities_').fillna(np.nan))

print (results)

输出:

print (results.to_string())
      session_id  unix_timestamp                                        cities  user_user_id user_joining_date user_country        cities_0         cities_1          cities_2
0  X061RFWB06K9V    1.442504e+09                        New York NY, Newark NJ        2024.0        2015-03-22           UK     New York NY        Newark NJ               NaN
1  5AZ2X2A9BHH5U    1.441354e+09  New York NY, Jersey City NJ, Philadelphia PA        2853.0        2015-03-28           DE     New York NY   Jersey City NJ   Philadelphia PA
2  SHTB4IYAX4PX6    1.440843e+09                                San Antonio TX       10958.0        2015-03-06           UK  San Antonio TX              NaN               NaN