我有一个如下所示的数据框,我需要编写一个函数,该函数应该能够为我提供以下结果:
输入参数:
'INDIA'
'Student'
我的输入数据框如下:
Card Name Country Age Code Amount
0 AAA INDIA Young House 100
1 AAA Australia Old Hardware 200
2 AAA INDIA Student House 300
3 AAA US Young Hardware 600
4 AAA INDIA Student Electricity 200
5 BBB Australia Young Electricity 100
6 BBB INDIA Student Electricity 200
7 BBB Australia Young House 450
8 BBB INDIA Old House 150
9 CCC Australia Old Hardware 200
10 CCC Australia Young House 350
11 CCC INDIA Old Electricity 400
12 CCC US Young House 200
预期输出为
Code Total Amount Frequency Average
0 Electricity 400 2 200
1 House 300 1 300
基于金额的总和,给定国家(=印度)和年龄(=学生)的前10名(在我们的示例中,我们只能获得前2名)代码。此外,它还应在新列“ Frequency”(频率)中添加计数。该组中的记录,“平均”列将是总和/频率
我尝试过
df.groupby(['Country','Age','Code']).agg({'Amount': sum})['Amount'].groupby(level=0, group_keys=False).nlargest(10)
产生
Country Age Code
Australia Young House 800
Old Hardware 400
Young Electricity 100
INDIA Old Electricity 400
Student Electricity 400
House 300
Old House 150
Young House 100
US Young Hardware 600
House 200
Name: Amount, dtype: int64
很遗憾,它与预期的输出结果不同。
答案 0 :(得分:3)
给予
>>> df
Card Name Country Age Code Amount
0 AAA INDIA Young House 100
1 AAA Australia Old Hardware 200
2 AAA INDIA Student House 300
3 AAA US Young Hardware 600
4 AAA INDIA Student Electricity 200
5 BBB Australia Young Electricity 100
6 BBB INDIA Student Electricity 200
7 BBB Australia Young House 450
8 BBB INDIA Old House 150
9 CCC Australia Old Hardware 200
10 CCC Australia Young House 350
11 CCC INDIA Old Electricity 400
12 CCC US Young House 200
您可以先过滤数据框:
>>> country = 'INDIA'
>>> age = 'Student'
>>> tmp = df[df.Country.eq(country) & df.Age.eq(age)].loc[:, ['Code', 'Amount']]
>>> tmp
Code Amount
2 House 300
4 Electricity 200
6 Electricity 200
...然后分组:
>>> result = tmp.groupby('Code')['Amount'].agg([['Total Amount', 'sum'], ['Frequency', 'size'], ['Average', 'mean']]).reset_index()
>>> result
Code Total Amount Frequency Average
0 Electricity 400 2 200
1 House 300 1 300
如果我正确理解了按总金额过滤的条件,则可以发出
result.nlargest(10, 'Total Amount')