利用熊猫有效地计算时间特征

时间:2020-10-15 14:40:36

标签: python pandas performance

我有以下.csv文件:

Match_idx,Date,Player_1,Player_2,Player_1_wins
0,2020-01-01,p1,p2,1
1,2020-01-02,p2,p3,0
2,2020-01-03,p3,p1,1
3,2020-01-04,p4,p1,1

我想计算更多的列以获得以下输出.csv文件:

Match_idx,Date,Player_1,Player_2,Player_1_wins,Player_1_winrate,Player_2_winrate,Player_1_matches,Player_2_matches,Head_to_head
0,2020-01-01,p1,p2,1,0,0,0,0,0,''
1,2020-01-02,p2,p3,0,0,0,1,0,0,''
2,2020-01-03,p3,p1,1,1,1,1,1,0,''
3,2020-01-04,p4,p1,1,0,1/2,0,2,0,''
4,2020-01-05,p1,p3,0,1/2,2/2,3,2,'0'
5,2020-01-06,p3,p1,1,1/3,3/3,4,3,'11'

每列的语义:

  • Match_idxDatePlayer_1Player_2:简单
  • Player_1_winsPlayer_1赢了比赛吗? 1:0

这些列将得到维护,我想添加这些列:

  • Player_1_winrate:number_of_wins_for_player_1_before_this_one / number_of_matches_played_by_player_1_before_this_one

  • Player_2_winrate:与上述player_2相同

  • Player_1_matches:number_of_matches_played_by_player_1_before_this_one

  • Player_2_matches:与上述player_2相同

  • Head_to_headPlayer_1Player_2之间先前的比赛结果。如果Player_1赢得比赛,则编码为字符串'{'0'和'1'},并加'1',否则为'0'。

我做了什么

我正在使用pandas库来操纵该文件。我一直在思考的幼稚方法如下:选择每场比赛(输赢),玩家参加的比赛以及按日期排序。然后,对于赢率功能,将以下两个功能应用于比赛。

def get_matches_won_before_by_player(df: pd.DataFrame, player: str, before: str):
    mask_player_won = (
        ((df['Player_1_wins'] == 1) & (df['Player_1'] == player)) | 
        ((df['Player_1_wins'] == 0) & (df['Player_2'] == player))
    )

    req = df[(df['Date'] < before) & mask_player_won]
    req.sort_values(by='Date', inplace=True)
    return req

def get_matches_played_before_by_player(df: pd.DataFrame, player: str, before: str):
    mask_player_played = (
        (df['Player_1'] == player) | 
        (df['Player_2'] == player)
    )

    req = df[(df['Date'] < before) & mask_player_played]
    req.sort_values(by='Date', inplace=True)
    return req

我可以将这种逻辑应用于每次比赛,但这将涉及为每次比赛运行那些功能,这是非常无效的。

我想做什么

如何仅使用给定比赛中每个玩家的最后一场比赛来有效地计算我的功能?例如,可以使用以下逻辑来更新每个玩家的获胜率:

  1. 将每列初始化为0。
  2. 更新获胜率,如下:(M / M + 1)+(W / N + 1),其中M是当前获胜率,N是当前进行的比赛数,而{{ 1}} = 1,如果玩家获胜,则为0。

任何组织此类过程的帮助或想法都将受到赞赏。

1 个答案:

答案 0 :(得分:0)

我尝试对系列进行操作,以使解决方案快速运行。我将通过代码中的注释进行解释。

# to return head to head
strp1gw = ""
def get_head_to_head(s):
    global strp1gw
    strp1gw +=str(s)
    return strp1gw

(
    df = df
    .assign(
        # this is player 1 all wins before but to avoid creating extra columns I named it as Player_1_winrate to replace it with rate as you dont need cumulative sum of wins
        Player_1_winrate = lambda x: x['Player_1_wins'].cumsum(),
        # if player 1 played?
        Player_1_matches = lambda x: np.where((x['Player_1'] =='p1') | (x['Player_2'] == 'p1'),1,0)
    )
    # this is number of matches played by player 1 before this one
    .assign(Player_1_matches = lambda x: x['Player_1_matches'].cumsum())
    # the player 1 winrate
    .assign(Player_1_winrate = lambda x: x['Player_1_winrate']/x['Player_1_matches'])
    # same for player 2 but you didnt mention how to compute Player_2_wins
    .assign(
        Player_2_winrate = lambda x: x['Player_2_wins'].cumsum(),
        Player_2_matches = lambda x: np.where((x['Player_1'] =='p2') | (x['Player_2'] == 'p2'),1,0)
    )
    .assign(Player_2_matches = lambda x: x['Player_2_matches'].cumsum())
    .assign(Player_2_winrate = lambda x: x['Player_2_winrate']/x['Player_2_matches'])
    # to apply function to get head to head value
    .assign(Head_to_head=lambda x: x['Player_1_wins'].apply(lambda s: get_head_to_head(s)))
)