在子集中找到最高价值的分组依据

时间:2019-02-23 20:02:47

标签: python pandas

我的数据如下:

In [16]: game_df.head(9)
Out[16]: 
   team_id  game_id game_date  w  l  wins  losses  winning%  
0        1        1  11/16/18  1  0    20      10  0.666667
1        1        3  11/18/18  0  1    20      11  0.645161
2        1        6  11/21/18  0  1    20      12  0.625000
3        2        4  11/19/18  1  0    16      14  0.533333
4        2        8  11/23/18  1  0    17      14  0.548387
5        2        9  11/24/18  0  1    17      15  0.531250
6        3        2  11/17/18  0  1    24       8  0.750000
7        3        5  11/20/18  1  0    25       8  0.757576
8        3        7  11/22/18  1  0    26       8  0.764706

我需要获取Winning%列,并从每个team_id(包括两端)的最新观察值中减去每一行的观察值,但仅使用最大值。

所以我想找回这样的东西:

In [16]: game_df.head(9)
Out[16]: 
   team_id  game_id game_date  w  l  wins  losses  winning% w%_bac
0        1        1  11/16/18  1  0    20      10  0.666667      --
1        1        3  11/18/18  0  1    20      11  0.645161  -0.10483
2        1        6  11/21/18  0  1    20      12  0.625000  -0.13257
3        2        4  11/19/18  1  0    16      14  0.533333  -0.21667
4        2        8  11/23/18  1  0    17      14  0.548387  -0.21632
5        2        9  11/24/18  0  1    17      15  0.531250  -0.23346
6        3        2  11/17/18  0  1    24       8  0.750000   0.00000
7        3        5  11/20/18  1  0    25       8  0.757576   0.00000
8        3        7  11/22/18  1  0    26       8  0.764706   0.00000

因此,在第9场比赛中,第11/24/18队2输了,获胜率从0.548387下降到0.531250。因此,与其他两支球队相比,它的排名还处于后面。在当时,这支队伍分别为0.625000(第1队)和0.764706(第3队)。因此,%back小组#2将是-0.233456。

最后,我需要计算每个team_id在那个时刻的顺序,即在11/24/18上,team_id的排名将是3,1,2。

谢谢

1 个答案:

答案 0 :(得分:0)

df = df.sort_values(by='game_date')  # sort by date

# add a column for each team's latest %age, fill forward NaN (but not back)
for team_id in df['team_id'].unique():
    df[str(team_id) + 'win_%'] = df.loc[df.team_id == team_id, ['winning%', 'game_date']].set_index(
        'game_date').reindex(df.game_date).sort_index().fillna(method='ffill').values
# fillback missing (NaN) with 0
df = df.fillna(0)
# get min difference (greatest negative) for each row
df['w%_bac'] = pd.concat([df['winning%'] - df['1win_%'], df['winning%'] - df['2win_%'], df['winning%'] - 
                          df['3win_%']], axis=1).min(1)
# drop helper columns
df = df.drop(columns=['1win_%', '2win_%', '3win_%'])

df

    team_id     game_id     game_date   w   l   wins    losses  winning%    w%_bac
0   1             1     11/16/18         1  0   20      10      0.667   0.000
6   3             2     11/17/18         0  1   24      8       0.750   0.000
1   1             3     11/18/18         0  1   20      11      0.645   -0.105
3   2             4     11/19/18         1  0   16      14      0.533   -0.217
7   3             5     11/20/18         1  0   25      8       0.758   0.000
2   1             6     11/21/18         0  1   20      12      0.625   -0.133
8   3             7     11/22/18         1  0   26      8       0.765   0.000
4   2             8     11/23/18         1  0   17     14       0.548   -0.216
5   2             9     11/24/18         0  1   17     15       0.531   -0.233