我正在建立一个曲棍球比赛的数据集,需要根据'Game_id'
和'Goals'
列确定一支球队是赢还是输。每个游戏都有自己的ID,并跨越两行,因此,在2000行中存储了1000个游戏。
我的数据框如下:
Team Home/Away Goals Game_id
CAL Home 7 2017020001
PHY Away 4 2017020001
CAP Home 7 2017020002
WILD Away 4 2017020002
我需要一个基于特定'Won/Lost'
的目标的健身列'Game_id'
。我正在努力创建一个为我做到这一点的循环。
我正在寻找的结果是:
Team Home/Away Goals Game_id Won/Lost
CAL Home 7 2017020001 Won
PHY Away 4 2017020001 Lost
CAP Home 7 2017020002 Won
WILD Away 4 2017020002 Lost
答案 0 :(得分:3)
给出
>>> df
Team Home/Away Goals Game_id
0 CAL Home 7 2017020001
1 PHY Away 4 2017020001
2 CAP Home 7 2017020002
3 WILD Away 4 2017020002
4 WILD Away 1 2017020003
5 CAP Home 1 2017020003
我要编写以下函数:
def win_loss_draw(group):
group = group == group.max()
if group.all():
group[:] = 'Draw'
else:
group = group.map({True: 'Won', False: 'Lost'})
return group
...并像这样应用它:
>>> df['Won/Lost'] = df.groupby('Game_id')['Goals'].apply(win_loss_draw)
>>> df
Team Home/Away Goals Game_id Won/Lost
0 CAL Home 7 2017020001 Won
1 PHY Away 4 2017020001 Lost
2 CAP Home 7 2017020002 Won
3 WILD Away 4 2017020002 Lost
4 WILD Away 1 2017020003 Draw
5 CAP Home 1 2017020003 Draw
考虑到冰球比赛只能在常规时间内以平局结束,因此我不考虑平局,但是我的数据随着时间的推移而变化,所以只有输赢
在这种情况下,发出就足够了
df['Won/Lost'] = df.groupby('Game_id')['Goals'].apply(lambda g: (g == g.max()).map({True: 'Won', False: 'Lost'}))
(这是版本1)
〜编辑〜
性能改进!
版本2:
is_winner = df.groupby('Game_id')['Goals'].transform('max') == df['Goals']
df['Won/Lost'] = is_winner.map({True: 'Won', False: 'Lost'})
版本3:
is_winner = df.groupby('Game_id')['Goals'].transform('max') == df['Goals']
df['Won/Lost'] = np.where(is_winner.values, 'Won', 'Lost')
时间:
# Setup
>>> df = pd.concat([df]*1000, ignore_index=True)
>>> df['Game_id'] = np.arange(len(df)//2).repeat(2)
>>>
>>> df
Team Home/Away Goals Game_id
0 CAL Home 7 0
1 PHY Away 4 0
2 CAP Home 7 1
3 WILD Away 4 1
4 CAL Home 7 2
... ... ... ... ...
3995 WILD Away 4 1997
3996 CAL Home 7 1998
3997 PHY Away 4 1998
3998 CAP Home 7 1999
3999 WILD Away 4 1999
# Timings (i5-6200U CPU @ 2.30GHz, only relative times are important though)
>>> %timeit df.groupby('Game_id')['Goals'].apply(lambda g: (g == g.max()).map({True: 'Won', False: 'Lost'})) # Version 1
1.73 s ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit (df.groupby('Game_id')['Goals'].transform('max') == df['Goals']).map({True: 'Won', False: 'Lost'}) # Version 2
2.38 ms ± 37.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit np.where((df.groupby('Game_id')['Goals'].transform('max') == df['Goals']).values, 'Won', 'Lost') # Version 3
1.53 ms ± 6.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)