我有一个数据框如下:
axis-title
我需要得到:
Member Category Total
1001 1 5
1001 2 4
1001 3 9
1003 1 7
1003 2 5
1003 3 2
1005 1 2
1005 3 5
即每个成员的总数的平均值。例如,成员1001总共有18个,其中类别1代表总数的27%。因此平均值为0.27。
我尝试的是:
Member Category Total Average
1001 1 5 0.27
1001 2 4 0.22
1001 3 9 0.5
1003 1 7 0.5
1003 2 5 0.35
1003 3 2 0.15
1005 1 2 0.28
1005 3 5 0.72
但是,它不仅不起作用,而且由于我的数据量非常大,因此速度太慢。
答案 0 :(得分:1)
df['ave']=df.groupby('Member').Total.apply(lambda x : x/sum(x))
df
Out[318]:
Member Category Total ave
0 1001 1 5 0.277778
1 1001 2 4 0.222222
2 1001 3 9 0.500000
3 1003 1 7 0.500000
4 1003 2 5 0.357143
5 1003 3 2 0.142857
6 1005 1 2 0.285714
7 1005 3 5 0.714286
答案 1 :(得分:1)
sum
使用transform
除以Total
列:
df['Average'] = df['Total'] / df.groupby('Member')['Total'].transform('sum')
print (df)
Member Category Total Average
0 1001 1 5 0.277778
1 1001 2 4 0.222222
2 1001 3 9 0.500000
3 1003 1 7 0.500000
4 1003 2 5 0.357143
5 1003 3 2 0.142857
6 1005 1 2 0.285714
7 1005 3 5 0.714286
详情:
print (df.groupby('Member')['Total'].transform('sum'))
0 18
1 18
2 18
3 14
4 14
5 14
6 7
7 7
Name: Total, dtype: int64
替代解决方案:
df['Average'] = df['Total'] / df['Member'].map(df.groupby('Member')['Total'].sum())
<强>计时强>:
np.random.seed(123)
N = 100000
L = ['AV','DF','SD','RF','F','WW','FG','SX']
dates = pd.date_range('2015-01-01', '2015-02-20')
df = pd.DataFrame(np.random.randint(100, size=(N, 3)), columns=['Member','Category','Total'])
df = df.sort_values(['Member','Category']).reset_index(drop=True)
#Wen solution
In [395]: %timeit df.groupby('Member').Total.apply(lambda x : x/sum(x))
10 loops, best of 3: 31.2 ms per loop
In [396]: %timeit df['Total'] / df.groupby('Member')['Total'].transform('sum')
100 loops, best of 3: 5.11 ms per loop
#alternative a bit slowier solution
In [397]: %timeit df['Total'] / df['Member'].map(df.groupby('Member')['Total'].sum())
100 loops, best of 3: 9.92 ms per loop