Pandas中数据框中多列的加权平均值

时间:2017-05-13 04:00:47

标签: python pandas

我有一个如下的数据框

Class|  Student|    V1| V2| V3| wb

A|      Max|        10| 12| 14| 1

A|      Ann|        9|  6|  7|  0.9

B|      Tom|        6|  7|  10| 0.3

B|      Dick|       3|  8|  7|  0.7

C|      Dibs|       5|  2|  3|  0.8

C|      Mock|       6|  4|  3|  0.6

D|      Sunny|      3|  4|  5|  0.9

D|      Lock|       8|  3|  6|  1

我想计算按类分组的V1,V2,V3的加权平均值 结果应该是下面的

Class  V1_M  V2_M V3_M

A   9  8   3

B   5  3   3

C   4  4   3

到目前为止,我可以为每列分隔数据框。但我感觉非常低效

这是1变量的代码

import pandas as pd
import numpy as np

def wtdavg(frame, var, wb):
  d = frame[var]
  w = frame[wb]
  return (d * w).sum() / w.sum()

df = pd.read_csv('Sample.csv')
Matrix = df.groupby(['Class']).apply(wtdavg,var='V2',wb='wb')
print(Matrix)

我是新手,有一周的熊猫经验。提前致谢。

最高

3 个答案:

答案 0 :(得分:1)

#use apply to calculate weighted mean for alll 3 columns in one go.
df2 = df.groupby('Class').apply(lambda x: pd.Series([sum(x.V1*x.wb)/sum(x.wb), sum(x.V2*x.wb)/sum(x.wb), sum(x.V3*x.wb)/sum(x.wb)]))
#rename columns
df2.columns=['V1_M','V2_M','V3_M']

df2
Out[858]: 
           V1_M      V2_M       V3_M
Class                               
A      9.526316  9.157895  10.684211
B      3.900000  7.700000   7.900000
C      5.428571  2.857143   3.000000
D      5.631579  3.473684   5.526316

<强>更新

#put all your variable names in a list (can be copied over from df.columns)
var_cols = ['V1', 'V2', 'V3']
df2 = df.groupby('Class').apply(lambda x: pd.Series([sum(x[v*x.wb)/sum(x.wb) for v in var_cols]))
df2.columns = [e+'_M' for e in var_cols]
           V1_M      V2_M       V3_M
Class                               
A      9.526316  9.157895  10.684211
B      3.900000  7.700000   7.900000
C      5.428571  2.857143   3.000000
D      5.631579  3.473684   5.526316

答案 1 :(得分:1)

更一般的解决方案:

1.为没有StudentClass的所有列创建加权平均值:

df2 = df.drop('Student', axis=1) \
        .groupby('Class') \
        .apply(lambda x: x.drop(['Class', 'wb'], axis=1).mul(x.wb, 0).sum() / (x.wb).sum()) \
        .add_suffix('_M') \
        .reset_index()
print (df2)
  Class      V1_M      V2_M       V3_M
0     A  9.526316  9.157895  10.684211
1     B  3.900000  7.700000   7.900000
2     C  5.428571  2.857143   3.000000
3     D  5.631579  3.473684   5.526316

或者您可以为加权平均值定义列:

df2 = df.groupby('Class') \
        .apply(lambda x: x[['V1', 'V2', 'V3']].mul(x.wb, 0).sum() / (x.wb).sum()) \
        .add_suffix('_M') \
        .reset_index()
print (df2)
  Class      V1_M      V2_M       V3_M
0     A  9.526316  9.157895  10.684211
1     B  3.900000  7.700000   7.900000
2     C  5.428571  2.857143   3.000000
3     D  5.631579  3.473684   5.526316

更一般的是,所有列都以filterV开头:

df2 = df.groupby('Class') \
        .apply(lambda x: x.filter(regex='^V').mul(x.wb, 0).sum() / (x.wb).sum()) \
        .add_suffix('_M') \
        .reset_index()
print (df2)
  Class      V1_M      V2_M       V3_M
0     A  9.526316  9.157895  10.684211
1     B  3.900000  7.700000   7.900000
2     C  5.428571  2.857143   3.000000
3     D  5.631579  3.473684   5.526316

答案 2 :(得分:0)

import pandas as pd
import numpy as np

def wtdavg(frame, var, wb):
  d = frame[var]
  w = frame[wb]
  return (d * w).sum() / w.sum()

df = pd.read_csv('Sample.csv')
temp_df = pd.DataFrame()
for column in df.columns:
    if df[column].dtype == np.int64:
        temp_S = pd.DataFrame( df[column].groupby(df['Class']).mean())
        frames = [temp_df, temp_S]
        temp_df = pd.concat(frames, axis = 'columns')
print temp_df