对于其他列的每次更改,我们如何用这些行的平均值替换多行数据?

时间:2018-11-12 13:57:32

标签: python pandas numpy dataframe pandas-groupby

我有一个数据框,其中的“奖牌”列包含金,银或青铜。 有一个身高栏和一个年栏。看起来像这样-

Medal   Year    Height      0
Bronze  1896    -2.352063   1
Bronze  1896    -0.435173   1
Bronze  1896    0.220606    1
Bronze  1896    0.304680    1
Bronze  1896    0.607347    1
Bronze  1900    -1.847618   1
Bronze  1900    -1.410432   1
Bronze  1900    -0.334284   1
Bronze  1900    -0.182950   1
Bronze  1900    -0.031617   3
Bronze  1900    0.136532    2
Silver  2016    1.078162    9
Silver  2016    1.179051    2
Silver  2016    1.279940    1
Silver  2016    1.380829    4
Silver  2016    1.481718    3
Silver  2016    1.582607    3
Silver  2016    1.683495    8
Silver  2016    1.784384    4
Silver  2016    1.885273    3
Silver  2016    2.087051    1
Silver  2016    2.187940    1
Silver  2016    2.288829    1
Silver  2016    2.591496    1
Silver  2016    2.692385    1
Silver  2016    2.995052    1

我想要的很简单-

Medal   Year    Height      
Bronze  1896    [Mean of heights having Bronze and 1896] 
Bronze  1896    [Mean of heights having Bronze and 1900]   
Bronze  1896    [Mean of heights having Silver and 2016]

[0]栏也代表频率。因此,在计算平均值之前,我们必须将其乘以高度。

我尝试使用np.einsum,但无法使其适用于我的情况。有一些类似的问题,但没有一个答案符合我的要求。 任何提示都会有所帮助。

PS:我已经对heights列进行了归一化处理,因此将其为负值

2 个答案:

答案 0 :(得分:4)

一种方法是通过熊猫groupby创建2个系列,并将另一个除以:

group_cols = ['Medal', 'Year']
observations = df.groupby(group_cols)[0].sum()
total_height = df.assign(total=df['Height']*df[0]).groupby(group_cols)['total'].sum()

res = total_height / observations

print(res.reset_index())

    Medal  Year         0
0  Bronze  1896 -0.330921
1  Bronze  1900 -0.399675
2  Silver  2016  1.608415

简洁得多(感谢@piRSquared):

df = df.rename(columns={0: 'Count'})

res = df.assign(Total=df['Height']*df['Count'])\
        .groupby(['Medal', 'Year']).sum()\
        .eval('Total / Count')\
        .rename('Mean').reset_index()

print(res)

    Medal  Year      Mean
0  Bronze  1896 -0.330921
1  Bronze  1900 -0.399675
2  Silver  2016  1.608415

答案 1 :(得分:3)

Date

  • 使用pandas.Index.repeat列重复'0'
  • 使用index对其重新编制索引
  • 然后loc

groupby