我有一个看起来像这样的数据框:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'match_id': ['m1', 'm1', 'm1', 'm1', 'm2', 'm2', 'm2', 'm2', 'm3', 'm3', 'm3', 'm3'],
...: 'name':['peter', 'mike', 'jeff', 'john', 'alex', 'joe', 'jeff', 'peter', 'alex', 'peter', '
...: joe', 'tom' ],
...: 'rank': [2, 3, 1, 4, 3, 1, 2, 4, 4, 3, 1, 2],
...: 'rating': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})
In [3]: df
Out[3]:
match_id name rank rating
0 m1 peter 2 100
1 m1 mike 3 100
2 m1 jeff 1 100
3 m1 john 4 100
4 m2 alex 3 100
5 m2 joe 1 100
6 m2 jeff 2 100
7 m2 peter 4 100
8 m3 alex 4 100
9 m3 peter 3 100
10 m3 joe 1 100
11 m3 tom 2 100
大约有三场比赛,唯一的“ match_id”,参与者的姓名,他们在比赛结束时的排名以及整个数据帧的默认评分得分手动设置为100。
我想基于“ match_id”对数据进行分组,并分别为每个匹配项运行一个函数,但是该函数的输出应用于更新下一个匹配项的列。
我想使用一个函数,该函数在每次比赛后计算球员的更新评分,并将其放在名为“ updated_rating”的新列中。对于第一场比赛,我很累的功能如下:
df = df.loc[df['match_id'] == 'm1']
N = len(df)
df['win_prob'] = 0.0
for i in range(N):
for j in range(N):
if i != j:
df['S'] = (N - df['rank']) / ((N*(N-1))/2)
df['win_prob'][i] += (1 / (1 + (10 ** ((df['rating'][i] - df['rating'][j])/400))))
df['normalized_win_prob'] = df['win_prob']/(N*(N-1)/2)
df['updated_rating'] = round(df['rating'] + (20 * (df['S'] - df['normalized_win_prob'])), 1)
这将为第一场比赛做好准备,并根据其原始评分计算更新的评分以及每个玩家获胜的概率。但是,我无法将其扩展为考虑以下匹配项。
由于一些球员再次出现在下一场比赛中,所以我想更新他们的评分(基于上一阶段计算的“ updated_rating”列),然后让该函数完成第二场比赛以及此后第三场比赛的工作。
例如,在第一个匹配项计算之后的输出 如下所示:
match_id name rank rating win_prob S normalized_win_prob updated_rating
0 m1 peter 2 100 1.5 0.333333 0.25 101.7
1 m1 mike 3 100 1.5 0.166667 0.25 98.3
2 m1 jeff 1 100 1.5 0.500000 0.25 105.0
3 m1 john 4 100 1.5 0.000000 0.25 95.0
关于如何有效地执行此操作的任何想法? 我的原始数据帧比该示例数据帧大得多,因此我的解决方案需要高效。
谢谢
答案 0 :(得分:1)
这是我的解决方案。由于您的算法必须逐个循环match_ids,因此我们首先需要对分组数据使用for-loop
。然后,要计算win_prob
,您必须遍历每一行并计算其相关概率,以在同一匹配中赢得其他行。不好看没想到更好的方法:(
df = pd.DataFrame({'match_id': ['m1', 'm1', 'm1', 'm1', 'm2', 'm2', 'm2', 'm2', 'm3', 'm3', 'm3', 'm3', 'm4', 'm4', 'm4', 'm4'],
'name':['peter', 'mike', 'jeff', 'john', 'alex', 'joe', 'jeff', 'peter', 'alex', 'peter', 'joe', 'tom', 'mike', 'john', 'tom', 'peter'],
'rank': [2, 3, 1, 4, 3, 1, 2, 4, 4, 3, 1, 2, 1, 3, 4, 2],
'rating': [100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]})
# Pre-compute variables that don't depend on ratings
df['N'] = df.groupby('match_id')['name'].transform('count')
df['total_comb'] = ((df['N']*(df['N']-1))/2)
df['S'] = (df['N'] - df['rank']) / df['total_comb']
# Initialize win_prob and updated_rating
df['win_prob'] = np.zeros(len(df))
df['updated_rating'] = df['rating']
df['prev_rating'] = df['rating']
grouped = df.groupby('match_id', sort=True)
dfa = pd.DataFrame() #Final results will be stored here
last_names = []
#Loop through the match_ids from m1 to m2, m3. Note you can sort them when use 'groupby'
for name, dfg in grouped:
dfm = dfg.copy()
# Update the 'updated_rating' coming from last match_id
if len(last_names) > 0:
dfm.drop(columns=['updated_rating'], inplace=True)
df_last = dfa.loc[dfa['match_id'].isin(last_names),['name', 'updated_rating']]
df_last.drop_duplicates(subset=['name'], keep='last', inplace=True)
dfm = dfm.merge(df_last, left_on='name', right_on='name', how='left')
dfm['prev_rating'] = np.where(np.isnan(dfm['updated_rating']), dfm['rating'], dfm['updated_rating'])
# Compute current 'updated_rating'
win_prob = []
for index, row in dfm.iterrows():
prob = np.sum(1.0/(1+10**((row['prev_rating'] - dfm['prev_rating'])/400)))-0.5 #subtract 0.5 to account for self
win_prob.append(prob)
dfm['win_prob'] = win_prob
dfm['normalized_win_prob'] = dfm['win_prob']/dfm['total_comb']
dfm['updated_rating'] = round(dfm['prev_rating'] + (20 * (dfm['S'] - dfm['normalized_win_prob'])), 1)
last_names.append(name)
dfa = pd.concat([dfa, dfm], sort=True)
dfa
输出:
N S match_id name normalized_win_prob prev_rating rank rating total_comb updated_rating win_prob
4 0.333333333 m1 peter 0.25 100 2 100 6 101.7 1.5
4 0.166666667 m1 mike 0.25 100 3 100 6 98.3 1.5
4 0.5 m1 jeff 0.25 100 1 100 6 105 1.5
4 0 m1 john 0.25 100 4 100 6 95 1.5
4 0.166666667 m2 alex 0.251606926 100 3 100 6 98.3 1.509641559
4 0.5 m2 joe 0.251606926 100 1 100 6 105 1.509641559
4 0.333333333 m2 jeff 0.24681015 105 2 100 6 106.7 1.480860898
4 0 m2 peter 0.249975997 101.7 4 100 6 96.7 1.499855985
4 0 m3 alex 0.251630798 98.3 4 100 6 93.3 1.509784788
4 0.166666667 m3 peter 0.253165649 96.7 3 100 6 95 1.518993896
4 0.5 m3 joe 0.245203608 105 1 100 6 110.1 1.47122165
4 0.333333333 m3 tom 0.249999944 100 2 100 6 101.7 1.499999666
4 0.5 m4 mike 0.249232493 98.3 1 100 6 103.3 1.495394959
4 0.166666667 m4 john 0.252398303 95 3 100 6 93.3 1.514389819
4 0 m4 tom 0.2459709 101.7 4 100 6 96.8 1.475825403
4 0.333333333 m4 peter 0.252398303 95 2 100 6 96.6 1.514389819