我有一个如下的数据框
Class| Student| V1| V2| V3| wb
A| Max| 10| 12| 14| 1
A| Ann| 9| 6| 7| 0.9
B| Tom| 6| 7| 10| 0.3
B| Dick| 3| 8| 7| 0.7
C| Dibs| 5| 2| 3| 0.8
C| Mock| 6| 4| 3| 0.6
D| Sunny| 3| 4| 5| 0.9
D| Lock| 8| 3| 6| 1
我想计算按类分组的V1,V2,V3的加权平均值 结果应该是下面的
Class V1_M V2_M V3_M
A 9 8 3
B 5 3 3
C 4 4 3
到目前为止,我可以为每列分隔数据框。但我感觉非常低效
这是1变量的代码
import pandas as pd
import numpy as np
def wtdavg(frame, var, wb):
d = frame[var]
w = frame[wb]
return (d * w).sum() / w.sum()
df = pd.read_csv('Sample.csv')
Matrix = df.groupby(['Class']).apply(wtdavg,var='V2',wb='wb')
print(Matrix)
我是新手,有一周的熊猫经验。提前致谢。
最高
答案 0 :(得分:1)
#use apply to calculate weighted mean for alll 3 columns in one go.
df2 = df.groupby('Class').apply(lambda x: pd.Series([sum(x.V1*x.wb)/sum(x.wb), sum(x.V2*x.wb)/sum(x.wb), sum(x.V3*x.wb)/sum(x.wb)]))
#rename columns
df2.columns=['V1_M','V2_M','V3_M']
df2
Out[858]:
V1_M V2_M V3_M
Class
A 9.526316 9.157895 10.684211
B 3.900000 7.700000 7.900000
C 5.428571 2.857143 3.000000
D 5.631579 3.473684 5.526316
<强>更新强>
#put all your variable names in a list (can be copied over from df.columns)
var_cols = ['V1', 'V2', 'V3']
df2 = df.groupby('Class').apply(lambda x: pd.Series([sum(x[v*x.wb)/sum(x.wb) for v in var_cols]))
df2.columns = [e+'_M' for e in var_cols]
V1_M V2_M V3_M
Class
A 9.526316 9.157895 10.684211
B 3.900000 7.700000 7.900000
C 5.428571 2.857143 3.000000
D 5.631579 3.473684 5.526316
答案 1 :(得分:1)
更一般的解决方案:
1.为没有Student
,Class
的所有列创建加权平均值:
df2 = df.drop('Student', axis=1) \
.groupby('Class') \
.apply(lambda x: x.drop(['Class', 'wb'], axis=1).mul(x.wb, 0).sum() / (x.wb).sum()) \
.add_suffix('_M') \
.reset_index()
print (df2)
Class V1_M V2_M V3_M
0 A 9.526316 9.157895 10.684211
1 B 3.900000 7.700000 7.900000
2 C 5.428571 2.857143 3.000000
3 D 5.631579 3.473684 5.526316
或者您可以为加权平均值定义列:
df2 = df.groupby('Class') \
.apply(lambda x: x[['V1', 'V2', 'V3']].mul(x.wb, 0).sum() / (x.wb).sum()) \
.add_suffix('_M') \
.reset_index()
print (df2)
Class V1_M V2_M V3_M
0 A 9.526316 9.157895 10.684211
1 B 3.900000 7.700000 7.900000
2 C 5.428571 2.857143 3.000000
3 D 5.631579 3.473684 5.526316
更一般的是,所有列都以filter
的V
开头:
df2 = df.groupby('Class') \
.apply(lambda x: x.filter(regex='^V').mul(x.wb, 0).sum() / (x.wb).sum()) \
.add_suffix('_M') \
.reset_index()
print (df2)
Class V1_M V2_M V3_M
0 A 9.526316 9.157895 10.684211
1 B 3.900000 7.700000 7.900000
2 C 5.428571 2.857143 3.000000
3 D 5.631579 3.473684 5.526316
答案 2 :(得分:0)
import pandas as pd
import numpy as np
def wtdavg(frame, var, wb):
d = frame[var]
w = frame[wb]
return (d * w).sum() / w.sum()
df = pd.read_csv('Sample.csv')
temp_df = pd.DataFrame()
for column in df.columns:
if df[column].dtype == np.int64:
temp_S = pd.DataFrame( df[column].groupby(df['Class']).mean())
frames = [temp_df, temp_S]
temp_df = pd.concat(frames, axis = 'columns')
print temp_df