我有一个数据框,其中的“奖牌”列包含金,银或青铜。 有一个身高栏和一个年栏。看起来像这样-
Medal Year Height 0
Bronze 1896 -2.352063 1
Bronze 1896 -0.435173 1
Bronze 1896 0.220606 1
Bronze 1896 0.304680 1
Bronze 1896 0.607347 1
Bronze 1900 -1.847618 1
Bronze 1900 -1.410432 1
Bronze 1900 -0.334284 1
Bronze 1900 -0.182950 1
Bronze 1900 -0.031617 3
Bronze 1900 0.136532 2
Silver 2016 1.078162 9
Silver 2016 1.179051 2
Silver 2016 1.279940 1
Silver 2016 1.380829 4
Silver 2016 1.481718 3
Silver 2016 1.582607 3
Silver 2016 1.683495 8
Silver 2016 1.784384 4
Silver 2016 1.885273 3
Silver 2016 2.087051 1
Silver 2016 2.187940 1
Silver 2016 2.288829 1
Silver 2016 2.591496 1
Silver 2016 2.692385 1
Silver 2016 2.995052 1
我想要的很简单-
Medal Year Height
Bronze 1896 [Mean of heights having Bronze and 1896]
Bronze 1896 [Mean of heights having Bronze and 1900]
Bronze 1896 [Mean of heights having Silver and 2016]
[0]栏也代表频率。因此,在计算平均值之前,我们必须将其乘以高度。
我尝试使用np.einsum
,但无法使其适用于我的情况。有一些类似的问题,但没有一个答案符合我的要求。
任何提示都会有所帮助。
PS:我已经对heights列进行了归一化处理,因此将其为负值
答案 0 :(得分:4)
一种方法是通过熊猫groupby
创建2个系列,并将另一个除以:
group_cols = ['Medal', 'Year']
observations = df.groupby(group_cols)[0].sum()
total_height = df.assign(total=df['Height']*df[0]).groupby(group_cols)['total'].sum()
res = total_height / observations
print(res.reset_index())
Medal Year 0
0 Bronze 1896 -0.330921
1 Bronze 1900 -0.399675
2 Silver 2016 1.608415
简洁得多(感谢@piRSquared):
df = df.rename(columns={0: 'Count'})
res = df.assign(Total=df['Height']*df['Count'])\
.groupby(['Medal', 'Year']).sum()\
.eval('Total / Count')\
.rename('Mean').reset_index()
print(res)
Medal Year Mean
0 Bronze 1896 -0.330921
1 Bronze 1900 -0.399675
2 Silver 2016 1.608415
答案 1 :(得分:3)
Date
pandas.Index.repeat
列重复'0'
index
对其重新编制索引loc
groupby