我有一个熊猫格式的数据框,其格式如下:
# df name: cust_sim_data_product_agg:
yearmo region products revenue
0 201711 CN ['Auto', 'Flood', 'Home', 'Liability', 'Life',... 690
1 201711 CN ['Auto', 'Flood', 'Home', 'Liability', 'Life'] 610
2 201711 CN ['Auto', 'Flood', 'Home', 'Liability'] 560
3 201711 CN ['Auto', 'Flood', 'Home', 'Life', 'Liability',... 690
4 201711 CN ['Auto', 'Flood', 'Home', 'Life', 'Mortgage', ... 690
我想将其汇总为以下形式的嵌套json:
{
yearmo: '201711'
data: [
{
name: 'SE',
value: 18090, # sum of all the values in the level below
children: [
{
name: '['Auto', 'Flood', 'Home',...], # this is product from the dataframe
value: 690 . # this is the revenue value
},
{
name: '['Flood', 'Home', 'Life'...],
value: 690
},
...
},
{
name: 'NE',
value: 16500, # sum of all the values in the level below
children: [
{
name: '['Auto', 'Home',...],
value: 210
},
{
name: '['Life'...],
value: 450
},
...
}
},
yearmo: '201712'
data: [
{
name: 'SE',
value: 24050,
children: [ ... ] # same format as above
},
{
name: 'NE',
value: 22400,
children: [ ... ] # same format as above
}
]
}
所以每个yearmo在json的顶层都有一个元素。在数据中,每个区域都会有一个条目,其中值是直接位于其下的级别的值的总和。子级是一组字典,其中每个字典都映射了熊猫DF中行级数据中的产品->名称和收入->值。
到目前为止,我最好的尝试是这样的:
def roll_yearmo_rev(d):
x1 = [{'name': n, 'value': v} for n,v in zip(d.products, d.revenue)]
x2 = {'children': x1, 'value': sum(d.revenue)}
return x2
def roll_yearmo(d):
x1 = [{'name': n, 'children': c} for n,c in zip(d.region, d.children)]
x2 = {'children': x1, 'value': sum(d.value)}
return x2
cust_sim_data_product_agg_dict = cust_sim_data_product_agg.groupby(['yearmo', 'region'])\
.apply(roll_yearmo_rev)
cust_sim_data_product_agg_dict = cust_sim_data_product_agg_dict.reset_index()
cust_sim_data_product_agg_dict.columns = ['yearmo' , 'region', 'children']
cust_sim_data_product_agg_dict = cust_sim_data_product_agg_dict.groupby(['yearmo'])\
.apply(roll_yearmo)
cust_sim_data_product_agg_dict = cust_sim_data_product_agg_dict.reset_index()
哪个失败,因为上一次汇总会引发以下错误:
AttributeError: 'DataFrame' object has no attribute 'value'
整个事情对我来说都很混乱。我阅读了split-apply-combine,它启发了groupby()和apply()的使用,但是我真的可以对该方法使用第二种意见,因为我很确定有更好的方法。任何建议将不胜感激。