按唯一ID分组,应用功能并为下一组更新特定列

时间:2019-08-06 23:55:05

标签: python pandas dataframe pandas-groupby updating

我有一个看起来像这样的数据框:


In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'match_id': ['m1', 'm1', 'm1', 'm1', 'm2', 'm2', 'm2', 'm2', 'm3', 'm3', 'm3', 'm3'],
   ...:                     'name':['peter', 'mike', 'jeff', 'john', 'alex', 'joe', 'jeff', 'peter', 'alex', 'peter', '
   ...: joe', 'tom' ],
   ...:                     'rank': [2, 3, 1, 4, 3, 1, 2, 4, 4, 3, 1, 2],
   ...:                     'rating': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})

In [3]: df
Out[3]:
    match_id    name    rank  rating
0          m1  peter     2     100
1          m1   mike     3     100
2          m1   jeff     1     100
3          m1   john     4     100
4          m2   alex     3     100
5          m2    joe     1     100
6          m2   jeff     2     100
7          m2  peter     4     100
8          m3   alex     4     100
9          m3  peter     3     100
10         m3    joe     1     100
11         m3    tom     2     100

大约有三场比赛,唯一的“ match_id”,参与者的姓名,他们在比赛结束时的排名以及整个数据帧的默认评分得分手动设置为100。

我想基于“ match_id”对数据进行分组,并分别为每个匹配项运行一个函数,但是该函数的输出应用于更新下一个匹配项的列。

我想使用一个函数,该函数在每次比赛后计算球员的更新评分,并将其放在名为“ updated_rating”的新列中。对于第一场比赛,我很累的功能如下:

df = df.loc[df['match_id'] == 'm1']
N = len(df)
df['win_prob'] = 0.0
for i in range(N):
    for j in range(N):
        if i != j:
            df['S'] = (N - df['rank']) / ((N*(N-1))/2)
            df['win_prob'][i] += (1 / (1 + (10 ** ((df['rating'][i] - df['rating'][j])/400))))
            df['normalized_win_prob'] = df['win_prob']/(N*(N-1)/2)
            df['updated_rating'] = round(df['rating'] + (20 * (df['S'] - df['normalized_win_prob'])), 1)

这将为第一场比赛做好准备,并根据其原始评分计算更新的评分以及每个玩家获胜的概率。但是,我无法将其扩展为考虑以下匹配项。

由于一些球员再次出现在下一场比赛中,所以我想更新他们的评分(基于上一阶段计算的“ updated_rating”列),然后让该函数完成第二场比赛以及此后第三场比赛的工作。

例如,在第一个匹配项计算之后的输出 如下所示:


match_id name rank rating  win_prob    S    normalized_win_prob  updated_rating
0   m1  peter   2   100     1.5     0.333333          0.25            101.7
1   m1  mike    3   100     1.5     0.166667          0.25             98.3
2   m1  jeff    1   100     1.5     0.500000          0.25            105.0
3   m1  john    4   100     1.5     0.000000          0.25             95.0

关于如何有效地执行此操作的任何想法? 我的原始数据帧比该示例数据帧大得多,因此我的解决方案需要高效。

谢谢

1 个答案:

答案 0 :(得分:1)

这是我的解决方案。由于您的算法必须逐个循环match_ids,因此我们首先需要对分组数据使用for-loop。然后,要计算win_prob,您必须遍历每一行并计算其相关概率,以在同一匹配中赢得其他行。不好看没想到更好的方法:(

df = pd.DataFrame({'match_id': ['m1', 'm1', 'm1', 'm1', 'm2', 'm2', 'm2', 'm2', 'm3', 'm3', 'm3', 'm3', 'm4', 'm4', 'm4', 'm4'],
                   'name':['peter', 'mike', 'jeff', 'john', 'alex', 'joe', 'jeff', 'peter', 'alex', 'peter', 'joe', 'tom', 'mike', 'john', 'tom', 'peter'],
                   'rank': [2, 3, 1, 4, 3, 1, 2, 4, 4, 3, 1, 2, 1, 3, 4, 2],
                   'rating': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})

# Pre-compute variables that don't depend on ratings
df['N'] = df.groupby('match_id')['name'].transform('count')
df['total_comb'] = ((df['N']*(df['N']-1))/2)
df['S'] = (df['N'] - df['rank']) / df['total_comb']

# Initialize win_prob and updated_rating
df['win_prob'] = np.zeros(len(df))
df['updated_rating'] = df['rating']
df['prev_rating'] = df['rating']

grouped = df.groupby('match_id', sort=True)

dfa = pd.DataFrame() #Final results will be stored here
last_names = []
#Loop through the match_ids from m1 to m2, m3. Note you can sort them when use 'groupby'
for name, dfg in grouped:
    dfm = dfg.copy()
    # Update the 'updated_rating' coming from last match_id
    if len(last_names) > 0:
        dfm.drop(columns=['updated_rating'], inplace=True)
        df_last = dfa.loc[dfa['match_id'].isin(last_names),['name', 'updated_rating']]
        df_last.drop_duplicates(subset=['name'], keep='last', inplace=True)
        dfm = dfm.merge(df_last, left_on='name', right_on='name', how='left')
        dfm['prev_rating'] = np.where(np.isnan(dfm['updated_rating']), dfm['rating'], dfm['updated_rating'])

    # Compute current 'updated_rating'
    win_prob = []
    for index, row in dfm.iterrows():
        prob = np.sum(1.0/(1+10**((row['prev_rating'] - dfm['prev_rating'])/400)))-0.5 #subtract 0.5 to account for self
        win_prob.append(prob)

    dfm['win_prob'] = win_prob
    dfm['normalized_win_prob'] = dfm['win_prob']/dfm['total_comb']
    dfm['updated_rating'] = round(dfm['prev_rating'] + (20 * (dfm['S'] - dfm['normalized_win_prob'])), 1)
    last_names.append(name)
    dfa = pd.concat([dfa, dfm], sort=True)

dfa   

输出:

N       S         match_id  name    normalized_win_prob prev_rating   rank  rating  total_comb  updated_rating  win_prob
4   0.333333333     m1      peter   0.25                    100         2   100     6               101.7       1.5
4   0.166666667     m1      mike    0.25                    100         3   100     6               98.3        1.5
4   0.5             m1      jeff    0.25                    100         1   100     6               105         1.5
4   0               m1      john    0.25                    100         4   100     6               95          1.5
4   0.166666667     m2      alex    0.251606926             100         3   100     6               98.3        1.509641559
4   0.5             m2      joe     0.251606926             100         1   100     6               105         1.509641559
4   0.333333333     m2      jeff    0.24681015              105         2   100     6               106.7       1.480860898
4   0               m2      peter   0.249975997             101.7       4   100     6               96.7        1.499855985
4   0               m3      alex    0.251630798             98.3        4   100     6               93.3        1.509784788
4   0.166666667     m3      peter   0.253165649             96.7        3   100     6               95          1.518993896
4   0.5             m3      joe     0.245203608             105         1   100     6               110.1       1.47122165
4   0.333333333     m3      tom     0.249999944             100         2   100     6               101.7       1.499999666
4   0.5             m4      mike    0.249232493             98.3        1   100     6               103.3       1.495394959
4   0.166666667     m4      john    0.252398303             95          3   100     6               93.3        1.514389819
4   0               m4      tom     0.2459709               101.7       4   100     6               96.8        1.475825403
4   0.333333333     m4      peter   0.252398303             95          2   100     6               96.6        1.514389819