Im使用熊猫计算许多列的加权平均值。在某些情况下,重量可能总计为零,因此我使用np.ma.average:
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict(dict([('ID', [1, 1, 1]),('HeightA', [1, 2, 3]), ('WeightA', [0, 0, 0]),('HeightB', [2, 4, 6]), ('WeightB', [1, 2, 4])]))
>>df
ID HeightA WeightA HeightB WeightB
0 1 1 0 2 1
1 1 2 0 4 2
2 1 3 0 6 4
wmA = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightA"])
wmB = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightB"])
f = {'HeightA':wmA,'HeightB':wmB}
df2 = df.groupby(['ID'])['HeightA','HeightB'].agg(f)
这行得通,但是我有很多列的身高和体重,所以我不想为每个列写一个lambda函数,所以我尝试:
def givewm(data,weightcolumn):
return np.ma.average(data, weights=data.loc[data.index, weightcolumn])
f = {'HeightA':givewm(df,'WeightA'),'HeightB':givewm(df,'WeightB')}
df2 = df.groupby(['ID'])['HeightA','HeightB'].agg(f)
给出错误:builtins.TypeError:当a的形状和权重不同时,必须指定轴。
我该如何编写一个以权重列名称作为输入返回加权均值的函数?
答案 0 :(得分:1)
使用双重嵌套函数,来自github的解决方案:
df = (pd.DataFrame.from_dict(dict([('ID', [1, 1, 1]),
('HeightA', [1, 2, 3]),
('WeightA', [10, 20, 30]),
('HeightB', [2, 4, 6]),
('WeightB', [1, 2, 4])])))
print (df)
ID HeightA WeightA HeightB WeightB
0 1 1 10 2 1
1 1 2 20 4 2
2 1 3 30 6 4
def givewm(weightcolumn):
def f1(x):
return np.ma.average(x, weights=df.loc[x.index, weightcolumn])
return f1
f = {'HeightA':givewm('WeightA'),'HeightB':givewm('WeightB')}
df2 = df.groupby('ID').agg(f)
print (df2)
HeightA HeightB
ID
1 2.333333 4.857143
验证解决方案:
wmA = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightA"])
wmB = lambda x: np.ma.average(x, weights=df.loc[x.index, "WeightB"])
f = {'HeightA':wmA,'HeightB':wmB}
df2 = df.groupby(['ID'])['HeightA','HeightB'].agg(f)
print (df2)
HeightA HeightB
ID
1 2.333333 4.857143