我下面正在使用的数据框:
这些是国际象棋游戏,我试图按游戏进行分组,然后根据该游戏中的下棋次数在每个游戏上执行功能...
game_id move_number colour avg_centi
0 03gDhPWr 1 white NaN
1 03gDhPWr 2 black 37.0
2 03gDhPWr 3 white 61.0
3 03gDhPWr 4 black -5.0
4 03gDhPWr 5 white 26.0
5 03gDhPWr 6 black 31.0
6 03gDhPWr 7 white -2.0
... ... ... ... ...
110091 zzaiRa7s 34 black NaN
110092 zzaiRa7s 35 white NaN
110093 zzaiRa7s 36 black NaN
110094 zzaiRa7s 37 white NaN
110095 zzaiRa7s 38 black NaN
110096 zzaiRa7s 39 white NaN
110097 zzaiRa7s 40 black NaN
具体地说,我正在使用pd.cut
创建一个新列game_phase
,其中列出了给定的举动是否在开局,中间游戏和残局中进行。
game_id move_number colour avg_centi phase
0 03gDhPWr 1 white NaN opening
1 03gDhPWr 2 black 37.0 opening
2 03gDhPWr 3 white 61.0 opening
3 03gDhPWr 4 black -5.0 opening
4 03gDhPWr 5 white 26.0 opening
5 03gDhPWr 6 black 31.0 opening
6 03gDhPWr 7 white -2.0 opening
.. ... ... ... ... ...
54 03gDhPWr 55 white 58.0 endgame
55 03gDhPWr 56 black 26.0 endgame
56 03gDhPWr 57 white 116.0 endgame
57 03gDhPWr 58 black 2000.0 endgame
58 03gDhPWr 59 white 0.0 endgame
59 03gDhPWr 60 black 0.0 endgame
60 03gDhPWr 61 white NaN endgame
我正在使用以下代码来实现这一目标。请注意,每个游戏都必须根据该游戏中的总移动次数分为opening
,middlegame
和endgame
箱。
for game_id, group in df.groupby('game_id'):
bins = (0, round(group['move_number'].max() * 1/3), round(group['move_number'].max() * 2/3),
group['move_number'].max())
phases = ["opening", "middlegame", "endgame"]
try:
group.loc[:, 'phase'] = pd.cut(group['move_number'], bins, labels=phases)
except:
group.loc[:, 'phase'] = None
print(group)
问题在于,要遍历数千个游戏中的每个游戏都需要永远的时间。
我认为应该有一种更快的方法来计算此值,而不是使用for
循环遍历各组并逐个执行计算。
答案 0 :(得分:1)
这是我通过一个简单示例想到的方法。
总结起来,分3个步骤:
max move number
max move number
move number/max move number
我的方法在test1()
中,而您的方法在test2()
中:
import pandas
import random
import time
a = []
for group in range(25):
for count in range(random.randint(900, 1000)):
a.append({'group': chr(65 + group), 'count': count})
def test1(x):
b = pandas.DataFrame(x)
max_df = b.groupby(by='group', as_index=False)['count'].max().rename(columns={'count': 'max'})
b = pandas.merge(b, max_df, on='group', how='left')
b['phase'] = 'opening'
b.loc[b['count'] > b['max'] / 3.0, 'phase'] = 'middlegame'
b.loc[b['count'] > b['max'] / 1.5, 'phase'] = 'endgame'
b.drop('max', axis=1, inplace=True)
return b
def test2(x):
df = pandas.DataFrame(x)
df['phase'] = ''
for game_id, group in df.groupby('group'):
bins = (0, round(group['count'].max() * 1 / 3), round(group['count'].max() * 2 / 3),
group['count'].max())
phases = ["opening", "middlegame", "endgame"]
try:
group.loc[:, 'phase'] = pandas.cut(group['count'], bins, labels=phases)
except:
group.loc[:, 'phase'] = None
return df
start_time = time.time()
out1 = test1(a)
print(time.time() - start_time)
start_time = time.time()
out2 = test2(a)
print(time.time() - start_time)
assert out1.to_dict() == out2.to_dict()
test1
比test2
快很多,尽管这仅运行1次:
test1: 0.09799647331237793
test2: 0.769993782043457
test2()
似乎存在一些问题:它实际上并没有修改数据框,因此phase
列为空。不确定它是否对您有用。
答案 1 :(得分:1)
这是尝试使用Apply的尝试:
def split_by_third(game):
game_length = len(game)
game = game.assign(phase_num=game['move_number']/game_length)
return game
def assign_phase(row):
if row['phase_num'] < 0.34:
return 'Beginning'
if row['phase_num'] > 0.34 and row['phase_num'] < 0.66:
return 'Middle'
if row['phase_num'] > 0.66:
return 'End'
df_grouped = df.groupby('game_id').apply(split_by_third)
df_grouped['phase'] =df_grouped.apply(lambda row: assign_phase(row), axis=1)
答案 2 :(得分:1)
按照@AlexanderReynolds的建议,我可以使用groupby.apply
使它与更干净,更快的代码一起使用
def define_move_phase(x):
bins = (0, round(x['move_number'].max() * 1/3), round(x['move_number'].max() * 2/3), x['move_number'].max())
phases = ["opening", "middlegame", "endgame"]
try:
x.loc[:, 'phase'] = pd.cut(x['move_number'], bins, labels=phases)
except ValueError:
x.loc[:, 'phase'] = None
return x
df.groupby('game_id').apply(define_move_phase)