将Pandas DF转换为嵌套JSON

时间:2018-09-28 18:48:08

标签: python pandas split-apply-combine

我有一个熊猫格式的数据框,其格式如下:

# df name: cust_sim_data_product_agg:
yearmo  region  products    revenue
0   201711  CN  ['Auto', 'Flood', 'Home', 'Liability', 'Life',...   690
1   201711  CN  ['Auto', 'Flood', 'Home', 'Liability', 'Life']  610
2   201711  CN  ['Auto', 'Flood', 'Home', 'Liability']  560
3   201711  CN  ['Auto', 'Flood', 'Home', 'Life', 'Liability',...   690
4   201711  CN  ['Auto', 'Flood', 'Home', 'Life', 'Mortgage', ...   690

我想将其汇总为以下形式的嵌套json:

{
  yearmo: '201711'
  data: [
    {
      name: 'SE',
      value: 18090, # sum of all the values in the level below
      children: [
        {
          name: '['Auto', 'Flood', 'Home',...], # this is product from the dataframe
          value: 690 . # this is the revenue value
        },
        {
          name: '['Flood', 'Home', 'Life'...],
          value: 690
        },
        ...
      },
      {
      name: 'NE',
      value: 16500, # sum of all the values in the level below
      children: [
        {
          name: '['Auto', 'Home',...],
          value: 210
        },
        {
          name: '['Life'...],
          value: 450
        },
        ...
      }
    },
  yearmo: '201712'
  data: [
    {
      name: 'SE',
      value: 24050,
      children: [ ... ] # same format as above
    },
    {
      name: 'NE',
      value: 22400,
      children: [ ... ] # same format as above
    }
  ]
}

所以每个yearmo在json的顶层都有一个元素。在数据中,每个区域都会有一个条目,其中值是直接位于其下的级别的值的总和。子级是一组字典,其中每个字典都映射了熊猫DF中行级数据中的产品->名称和收入->值。

到目前为止,我最好的尝试是这样的:

def roll_yearmo_rev(d):
    x1 = [{'name': n, 'value': v}  for n,v in zip(d.products, d.revenue)]
    x2 = {'children': x1, 'value': sum(d.revenue)}
    return x2

def roll_yearmo(d):
    x1 = [{'name': n, 'children': c} for n,c in zip(d.region, d.children)]
    x2 = {'children': x1, 'value': sum(d.value)}
    return x2

cust_sim_data_product_agg_dict = cust_sim_data_product_agg.groupby(['yearmo', 'region'])\
    .apply(roll_yearmo_rev)
cust_sim_data_product_agg_dict = cust_sim_data_product_agg_dict.reset_index()
cust_sim_data_product_agg_dict.columns = ['yearmo' , 'region', 'children']


cust_sim_data_product_agg_dict = cust_sim_data_product_agg_dict.groupby(['yearmo'])\
    .apply(roll_yearmo)
cust_sim_data_product_agg_dict = cust_sim_data_product_agg_dict.reset_index()

哪个失败,因为上一次汇总会引发以下错误:

AttributeError: 'DataFrame' object has no attribute 'value'

整个事情对我来说都很混乱。我阅读了split-apply-combine,它启发了groupby()和apply()的使用,但是我真的可以对该方法使用第二种意见,因为我很确定有更好的方法。任何建议将不胜感激。

0 个答案:

没有答案