有没有比for循环更快的方法来更改熊猫组

时间:2019-02-27 20:14:29

标签: python pandas

我下面正在使用的数据框:

这些是国际象棋游戏,我试图按游戏进行分组,然后根据该游戏中的下棋次数在每个游戏上执行功能...

        game_id     move_number colour  avg_centi
0       03gDhPWr    1           white   NaN
1       03gDhPWr    2           black   37.0
2       03gDhPWr    3           white   61.0
3       03gDhPWr    4           black   -5.0
4       03gDhPWr    5           white   26.0
5       03gDhPWr    6           black   31.0
6       03gDhPWr    7           white   -2.0
... ... ... ... ...
110091  zzaiRa7s    34          black   NaN
110092  zzaiRa7s    35          white   NaN
110093  zzaiRa7s    36          black   NaN
110094  zzaiRa7s    37          white   NaN
110095  zzaiRa7s    38          black   NaN
110096  zzaiRa7s    39          white   NaN
110097  zzaiRa7s    40          black   NaN

具体地说,我正在使用pd.cut创建一个新列game_phase,其中列出了给定的举动是否在开局,中间游戏和残局中进行。

     game_id  move_number colour  avg_centi    phase
0   03gDhPWr            1  white        NaN  opening
1   03gDhPWr            2  black       37.0  opening
2   03gDhPWr            3  white       61.0  opening
3   03gDhPWr            4  black       -5.0  opening
4   03gDhPWr            5  white       26.0  opening
5   03gDhPWr            6  black       31.0  opening
6   03gDhPWr            7  white       -2.0  opening
..       ...          ...    ...        ...      ...
54  03gDhPWr           55  white       58.0  endgame
55  03gDhPWr           56  black       26.0  endgame
56  03gDhPWr           57  white      116.0  endgame
57  03gDhPWr           58  black     2000.0  endgame
58  03gDhPWr           59  white        0.0  endgame
59  03gDhPWr           60  black        0.0  endgame
60  03gDhPWr           61  white        NaN  endgame

我正在使用以下代码来实现这一目标。请注意,每个游戏都必须根据该游戏中的总移动次数分为openingmiddlegameendgame箱。

for game_id, group in df.groupby('game_id'):
    bins = (0, round(group['move_number'].max() * 1/3), round(group['move_number'].max() * 2/3), 
            group['move_number'].max())
    phases = ["opening", "middlegame", "endgame"]
    try:
        group.loc[:, 'phase'] = pd.cut(group['move_number'], bins, labels=phases)
    except:
        group.loc[:, 'phase'] = None
    print(group)

问题在于,要遍历数千个游戏中的每个游戏都需要永远的时间。

我认为应该有一种更快的方法来计算此值,而不是使用for循环遍历各组并逐个执行计算。

3 个答案:

答案 0 :(得分:1)

这是我通过一个简单示例想到的方法。

总结起来,分3个步骤:

  1. 您可以使用groupby查找每个游戏的max move number
  2. 将新df合并到旧df中,包括max move number
  3. 通过计算move number/max move number
  4. 一次为所有游戏添加阶段

我的方法在test1()中,而您的方法在test2()中:

import pandas
import random
import time

a = []

for group in range(25):
    for count in range(random.randint(900, 1000)):
        a.append({'group': chr(65 + group), 'count': count})


def test1(x):
    b = pandas.DataFrame(x)

    max_df = b.groupby(by='group', as_index=False)['count'].max().rename(columns={'count': 'max'})

    b = pandas.merge(b, max_df, on='group', how='left')

    b['phase'] = 'opening'
    b.loc[b['count'] > b['max'] / 3.0, 'phase'] = 'middlegame'
    b.loc[b['count'] > b['max'] / 1.5, 'phase'] = 'endgame'
    b.drop('max', axis=1, inplace=True)
    return b


def test2(x):
    df = pandas.DataFrame(x)
    df['phase'] = ''
    for game_id, group in df.groupby('group'):
        bins = (0, round(group['count'].max() * 1 / 3), round(group['count'].max() * 2 / 3),
                group['count'].max())
        phases = ["opening", "middlegame", "endgame"]
        try:
            group.loc[:, 'phase'] = pandas.cut(group['count'], bins, labels=phases)
        except:
            group.loc[:, 'phase'] = None
    return df


start_time = time.time()
out1 = test1(a)
print(time.time() - start_time)

start_time = time.time()
out2 = test2(a)
print(time.time() - start_time)

assert out1.to_dict() == out2.to_dict()

test1test2快很多,尽管这仅运行1次:

test1: 0.09799647331237793
test2: 0.769993782043457

test2()似乎存在一些问题:它实际上并没有修改数据框,因此phase列为空。不确定它是否对您有用。

答案 1 :(得分:1)

这是尝试使用Apply的尝试:

def split_by_third(game):
    game_length = len(game)
    game = game.assign(phase_num=game['move_number']/game_length)

    return game

def assign_phase(row):
    if row['phase_num'] < 0.34:
        return 'Beginning'
    if row['phase_num'] > 0.34 and row['phase_num'] < 0.66:
        return 'Middle'
    if row['phase_num'] > 0.66:
        return 'End'

df_grouped = df.groupby('game_id').apply(split_by_third)

df_grouped['phase'] =df_grouped.apply(lambda row: assign_phase(row), axis=1)

答案 2 :(得分:1)

按照@AlexanderReynolds的建议,我可以使用groupby.apply使它与更干净,更快的代码一起使用

def define_move_phase(x):
    bins = (0, round(x['move_number'].max() * 1/3), round(x['move_number'].max() * 2/3), x['move_number'].max())    
    phases = ["opening", "middlegame", "endgame"]
    try:
        x.loc[:, 'phase'] = pd.cut(x['move_number'], bins, labels=phases)
    except ValueError:
        x.loc[:, 'phase'] = None
    return x

df.groupby('game_id').apply(define_move_phase)