我试图多次汇总数据集,但似乎无法找出使用pandas
进行汇总的正确方法。给定像这样的数据集:
donations = [
{
"amount": 100,
"organization": {
"name": "Org 1",
"total_budget": 8000,
"states": [
{
"name": "Maine",
"code": "ME"
},
{
"name": "Massachusetts",
"code": "MA"
}
]
}
},
{
"amount": 5000,
"organization": {
"name": "Org 2",
"total_budget": 10000,
"states": [
{
"name": "Massachusetts",
"code": "MA"
}
]
}
},
{
"amount": 5000,
"organization": {
"name": "Org 1",
"total_budget": 8000,
"states": [
{
"name": "Maine",
"code": "ME"
},
{
"name": "Massachusetts",
"code": "MA"
}
]
}
}
]
我期望的输出是按total_budget
和amount
列的状态进行的单个汇总。我已经很接近以下内容:
n = pd.json_normalize(donations, record_path=['organization', 'states'], meta=['amount', ['organization', 'total_budget'], ['organization', 'name']], record_prefix='states.')
df = pd.DataFrame(n)
grouped_df = df.groupby(['states.code', 'states.name', 'organization.name', 'organization.total_budget']).sum()
这给我的是按州分类的细目,仍然包括组织名称:
MA Massachusetts Org 1 8000 5100
Org 2 10000 5000
ME Maine Org 1 8000 5100
我知道我需要以相同的方式保持初始聚合函数以产生正确的结果,但是我不确定最后一步是要获得期望的结果,然后按状态对这些结果进行分组:>
MA Massachusetts 18000 10100
ME Maine 8000 5100
答案 0 :(得分:0)
我不知道这是否适用于您的实际数据。您创建的作为样本数据限制的方法将数据框除以您要聚合的值,并删除重复的行。然后将其分组和汇总,并将两个数据帧组合在一起。
df_a = df[['states.code', 'states.name', 'organization.name', 'amount']]
df_o = df[['states.code', 'states.name', 'organization.name', 'organization.total_budget']]
df = df_a.groupby(['states.code', 'states.name'])['amount'].sum().reset_index()
df_o.drop_duplicates(inplace=True)
df1 = df_o.groupby(['states.code', 'states.name'])['organization.total_budget'].sum().reset_index()
df1.merge(df, on=['states.code', 'states.name'], how='inner')
states.code states.name organization.total_budget amount
0 MA Massachusetts 18000 10100
1 ME Maine 8000 5100