我是这样的csv:
client1,client2,client3,client4,client5,client6,amount
,,,Comp1,,,4.475000
,,,Comp2,,,16.305584
,,,Comp3,,,4.050000
Comp2,Comp1,,Comp4,,,21.000000
,,,Comp4,,,30.000000
,Comp1,,Comp2,,,5.137500
,,,Comp3,,,52.650000
,,,Comp1,,,2.650000
Comp3,,,Comp3,,,29.000000
Comp5,,,Comp2,,,20.809000
Comp5,,,Comp2,,,15.100000
Comp5,,,Comp2,,,52.404000
将其读入pandas数据框df之后,我想分两步进行聚合:
步骤1:
首先,我总结了金额:
client1 client2 client3 client4 client5 client6 amount
Comp1 7.125000
Comp2 16.305584
Comp3 56.700000
Comp4 30.000000
Comp1 Comp2 5.137500
Comp2 Comp1 Comp4 21.000000
Comp3 Comp3 29.000000
Comp5 Comp2 88.313000
然后,我希望按每个客户端名称进行聚合,这样如果涉及多个客户端,就像在组5中一样,则必须在Comp1和Comp2之间平均分配5.1375。试着这样:
df.groupby(['client1','client2','client3','client4','client5','client6']).apply(lambda x: x['amount'].sum()/len(x) if x.any().nunique()>=1 else x['amount'].sum())
client1 client2 client3 client4 client5 client6 0
0 Comp1 3.562500
1 Comp2 16.305584
2 Comp3 28.350000
3 Comp4 30.000000
4 Comp1 Comp2 5.137500
5 Comp2 Comp1 Comp4 21.000000
6 Comp3 Comp3 29.000000
7 Comp5 Comp2 29.437667
预期输出为:
Client Amount
Comp1 4.475+21/3+5.1375/2+2.65 = 16.69375
Comp2 16.305584+21/3+20.809/2+15.10/2+52.404/2 = 67.462084
Comp3 4.05+52.65+29 = 85.7
Comp4 21/3+30 = 37
Comp5 20.809/2+15.10/2+52.404/2 = 44.1565
我尝试使用sum(axis=0)
但没有用。
答案 0 :(得分:3)
我们可以在这里使用一点数学
cols = ['amount']
# Divide the amount by non null fields
df['new'] = df['amount']/df.drop(cols,1).notnull().sum(1)
#Set the index as new by droping amount column, unstack and drop the nans.
x = df.drop(cols,1).set_index('new').unstack().dropna()
#Create dataframe just from amount and the clients
ndf = pd.DataFrame({'amount':x.index.droplevel(0).values,'clients':x.values})
#Groupby client and get the sum
ndf.groupby('clients').sum()
输出:
amount clients Comp1 16.360417 Comp2 69.697501 Comp3 85.700000 Comp4 36.666667 Comp5 44.156500
答案 1 :(得分:2)
我这样组织起来:
d = df.drop('amount', 1) # new df without `amount`
a = df.amount # separate series of `amount`
c = d.count(1) # count of non-null values
a.div(c).repeat(c).groupby(d.stack().values).sum()
Comp1 16.693750
Comp2 70.030834
Comp3 85.700000
Comp4 37.000000
Comp5 44.156500
dtype: float64