我想使用 pandas 中的.agg()
函数按组计算数据集中的一列的平均值和另一列的加权平均值。我知道一些解决方案,但是它们不是很简洁。
一个解决方案已发布在此处(pandas and groupby: how to calculate weighted averages within an agg,但它似乎仍然不太灵活,因为权重列在lambda函数定义中进行了硬编码。我正在寻找一种更接近于此的语法:
(
df
.groupby(['group'])
.agg(avg_x=('x', 'mean'),
wt_avg_y=('y', 'weighted_mean', weights='weight')
)
这是一个完全工作的示例,其中的代码似乎不必要地复杂:
import pandas as pd
import numpy as np
# sample dataset
df = pd.DataFrame({
'group': ['a', 'a', 'b', 'b'],
'x': [1, 2, 3, 4],
'y': [5, 6, 7, 8],
'weights': [0.75, 0.25, 0.75, 0.25]
})
df
#>>> group x y weights
#>>> 0 a 1 5 0.75
#>>> 1 a 2 6 0.25
#>>> 2 b 3 7 0.75
#>>> 3 b 4 8 0.25
# aggregation logic
summary = pd.concat(
[
df.groupby(['group']).x.mean(),
df.groupby(['group']).apply(lambda x: np.average(x['y'], weights=x['weights']))
], axis=1
)
# manipulation to format the output of the aggregation
summary = summary.reset_index().rename(columns={'x': 'avg_x', 0: 'wt_avg_y'})
# final output
summary
#>>> group avg_x wt_avg_y
#>>> 0 a 1.50 5.25
#>>> 1 b 3.50 7.25
答案 0 :(得分:0)
如何?
grouped = df.groupby('group')
def wavg(group):
group['mean_x'] = group['x'].mean()
group['wavg_y'] = np.average(group['y'], weights=group.loc[:, "weights"])
return group
grouped.apply(wavg)
答案 1 :(得分:0)
由于各组的权重总和为1
,因此您可以照常分配新的列和groupby:
(df.assign(wt_avg_y=df['y']*df['weights'])
.groupby('group')
.agg({'x': 'mean', 'wt_avg_y':'sum', 'weights':'sum'})
.assign(wt_avg_y=lambda x: x['wt_avg_y']/ x['weights'])
)
输出:
x wt_avg_y weights
group
a 1.5 5.25 1.0
b 3.5 7.25 1.0
答案 2 :(得分:0)
在整个DataFrame上使用.apply()
方法是我所能达到的最简单的解决方案,它不对功能定义中的列名进行硬编码。
import pandas as pd
import numpy as np
df = pd.DataFrame({
'group': ['a', 'a', 'b', 'b'],
'x': [1, 2, 3, 4],
'y': [5, 6, 7, 8],
'weights': [0.75, 0.25, 0.75, 0.25]
})
summary = (
df
.groupby(['group'])
.apply(
lambda x: pd.Series([
np.mean(x['x']),
np.average(x['y'], weights=x['weights'])
], index=['avg_x', 'wt_avg_y'])
)
.reset_index()
)
# final output
summary
#>>> group avg_x wt_avg_y
#>>> 0 a 1.50 5.25
#>>> 1 b 3.50 7.25
答案 3 :(得分:0)
尝试:
df["weights"]=df["weights"].div(df.join(df.groupby("group")["weights"].sum(), on="group", rsuffix="_2").iloc[:, -1])
df["y"]=df["y"].mul(df["weights"])
res=df.groupby("group", as_index=False).agg({"x": "mean", "y": "sum"})
输出:
group x y
0 a 1.5 5.25
1 b 3.5 7.25