我有一个大型数据框,其中一列名为货币和 amount_in_euros ,货币列包含EUR,GBR等数据,而amount_in_euros包含浮点值。我想计算每种货币的总和(欧元,GBR等),并将货币的最大值放在新系列中。 我必须为每个客户计算此操作。如何在熊猫中实现这一点。
输入:
Customer currency amount_in_euros
1 EUR 10
1 GBR 6
1 GBR 18
1 EUR 2
1 EUR 3
2 IND 12
.
.
.
输出:
Customer currency amount_in_euros max
1 EUR 10 GBR
1 GBR 6 GBR
1 GBR 18 GBR
1 EUR 2 GBR
1 EUR 3 GBR
2 IND 12 IND
.
.
.
到目前为止,我试过了,
df=pd.read_csv('analysis.csv')
res=pd.DataFrame()
for u,v in df.groupby(['Customer']):
temp= v[['currency','amount_in_euros']].groupby(['currency'])['amount_in_euros'].sum().reset_index().sort_values('amount_in_euros',ascending=False)
v['max']=temp['currency'].iloc[0]
res=res.append(v)
我的上述代码对我来说很好,但由于追加操作需要很长时间。请帮我解决这个问题。 提前谢谢。
答案 0 :(得分:4)
使用:
sum
和Customer
currency
sort_values
max
的行,drop_duplicates
set_index
Series
map
df1 = df.groupby(['Customer', 'currency'], as_index=False)['amount_in_euros'].sum()
s = (df1.sort_values(['Customer','amount_in_euros'])
.drop_duplicates('Customer', keep='last')
.set_index('Customer')['currency'])
df['max'] = df['Customer'].map(s)
print (df)
Customer currency amount_in_euros max
0 1 EUR 10 GBR
1 1 GBR 6 GBR
2 1 GBR 18 GBR
3 1 EUR 2 GBR
4 1 EUR 3 GBR
5 2 IND 12 IND
编辑:
新列中第一,第二,第三个值的类似解决方案:
print (df)
Customer currency amount_in_euros
0 1 EUR 10
1 1 GBR 6
2 1 GBR 18
3 1 EUR 2
4 1 USD 1
5 1 USD 2
6 1 EUR 3
7 2 IND 12
8 2 USD 2
df1 = df.groupby(['Customer', 'currency'], as_index=False)['amount_in_euros'].sum()
df2 = df1.sort_values(['Customer','amount_in_euros'])
df2 = (df2.set_index(['Customer',
df2.groupby(['Customer']).cumcount(ascending=False)])['currency']
.unstack()
.add_prefix('max_'))
print (df2)
max_0 max_1 max_2
Customer
1 GBR EUR USD
2 IND USD None
df = df.join(df2, on='Customer')
print (df)
Customer currency amount_in_euros max_0 max_1 max_2
0 1 EUR 10 GBR EUR USD
1 1 GBR 6 GBR EUR USD
2 1 GBR 18 GBR EUR USD
3 1 EUR 2 GBR EUR USD
4 1 USD 1 GBR EUR USD
5 1 USD 2 GBR EUR USD
6 1 EUR 3 GBR EUR USD
7 2 IND 12 IND USD None
8 2 USD 2 IND USD None