加权平均熊猫

时间:2018-08-30 06:13:42

标签: pandas

Im使用熊猫计算许多列的加权平均值。在某些情况下,重量可能总计为零,因此我使用np.ma.average:

import pandas as pd
import numpy as np

df = pd.DataFrame.from_dict(dict([('ID', [1, 1, 1]),('HeightA', [1, 2, 3]), ('WeightA', [0, 0, 0]),('HeightB', [2, 4, 6]), ('WeightB', [1, 2, 4])]))

>>df
   ID  HeightA  WeightA  HeightB  WeightB
0   1        1        0        2        1
1   1        2        0        4        2
2   1        3        0        6        4


wmA = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightA"])
wmB = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightB"])
f = {'HeightA':wmA,'HeightB':wmB}
df2 = df.groupby(['ID'])['HeightA','HeightB'].agg(f)

这行得通,但是我有很多列的身高和体重,所以我不想为每个列写一个lambda函数,所以我尝试:

def givewm(data,weightcolumn):
    return np.ma.average(data, weights=data.loc[data.index, weightcolumn])

f = {'HeightA':givewm(df,'WeightA'),'HeightB':givewm(df,'WeightB')}
df2 = df.groupby(['ID'])['HeightA','HeightB'].agg(f)

给出错误:builtins.TypeError:当a的形状和权重不同时,必须指定轴。

我该如何编写一个以权重列名称作为输入返回加权均值的函数?

1 个答案:

答案 0 :(得分:1)

使用双重嵌套函数,来自github的解决方案:

df = (pd.DataFrame.from_dict(dict([('ID', [1, 1, 1]),
                                  ('HeightA', [1, 2, 3]), 
                                  ('WeightA', [10, 20, 30]),
                                  ('HeightB', [2, 4, 6]), 
                                  ('WeightB', [1, 2, 4])])))


print (df)
   ID  HeightA  WeightA  HeightB  WeightB
0   1        1       10        2        1
1   1        2       20        4        2
2   1        3       30        6        4


def givewm(weightcolumn):
    def f1(x):
        return np.ma.average(x, weights=df.loc[x.index, weightcolumn])
    return f1

f = {'HeightA':givewm('WeightA'),'HeightB':givewm('WeightB')}
df2 = df.groupby('ID').agg(f)
print (df2)
     HeightA   HeightB
ID                    
1   2.333333  4.857143

验证解决方案:

wmA = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightA"])
wmB = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightB"])
f = {'HeightA':wmA,'HeightB':wmB}

df2 = df.groupby(['ID'])['HeightA','HeightB'].agg(f)
print (df2)
     HeightA   HeightB
ID                    
1   2.333333  4.857143