需要计算每个参赛队伍每场比赛的平均进球数(rolling_avg
),直至某个时间点。由于我的数据库是高度规范化的,这需要相当多的时间(+ 8k行)
备注:
我在计算比赛中的总进球数(第0栏),而不是每支球队的进球数。
此处省略了其他列但可能相关:其中一列是date_time
列。
示例:在第1行(第2行)中,我们可以看到团队1249
的第一个游戏(他们正在玩耍)。在这场比赛中,记录了3个进球。团队1249
的下一个游戏发生在第10行,并且由于该团队此次在家玩(因为id
出现在home_team_id
下),我想要{{1它应该是等于3.它不应该考虑当前行。
问题:
如何根据home_rolling_avg
中的先前值并排除当前行,计算每个团队扩展平均值/平均值的目标,并将此平均值分配给相关列(取决于球队是在主场还是在客场比赛?
total_goals
可以找到更大的样本(n = 100)here。已根据piRSquared(answer here)类型的请求添加。
答案 0 :(得分:2)
# easy tracking of long column names
hw = ['home_team_id', 'away_team_id']
# I found it easier to melt myself with some numpy help
hw_vals = df[hw].values.ravel() # flatten 2 columns
idx_rep = df.index.values.repeat(2) # repeat index [0, 0, 1, 1, ...
tot_rep = df.total_goals.values.repeat(2) # repeat totals [2, 2, 3, 3, ...
# This is the reshaped series of team ids with total_goals
s = pd.Series(tot_rep, [idx_rep, hw_vals])
# groupby with a combination of expanding().mean() and shift()
e = s.groupby(level=1).apply(lambda x: x.expanding().mean().shift()).dropna()
# style preference of mine to do assignments using index values
# and to get it done in one line
df.set_index(hw[0], append=1).assign(home_rolling_avg=e).reset_index(hw[0]) \
.set_index(hw[1], append=1).assign(away_rolling_avg=e).reset_index(hw[1])
更深入的解释
pd.melt
,我做了。但我发现语法比我最终做的更难。而且我知道numpy无论如何都会更快。
level=1
的第二级进行分组,然后进行expanding().mean()
分组。
assign
将新列添加到副本中。我喜欢这种方法,因为它允许我链接命令。所以我这样做,通过重置索引并为客队重复进程。分配是有效的,因为指数自然排列。 替代方法
使用defaultdict
中的Counter
+ collections
与Steven Rouch非常相似
from collections import defaultdict, Counter
c, d = Counter(), defaultdict(int)
home_avgpg = pd.Series(index=df.index)
away_avgpg = pd.Series(index=df.index)
for row in df.itertuples():
h = row.home_team_id
a = row.away_team_id
t = row.total_goals
if h in c:
home_avgpg.set_value(i, d[h] / c[h])
if a in c:
away_avgpg.set_value(i, d[a] / c[a])
d[h] += t
d[a] += t
c.update([h, a])
@ StevenRauch的回答非常快。
答案 1 :(得分:1)
我认为这符合问题陈述。它使用itertuples
迭代每一行并计算一个运行平均值:
teams_rolling_goals = {}
home_rolling_avg = []
away_rolling_avg = []
for row in df.itertuples():
# get previous goal and game counts
home_goals = teams_rolling_goals.get(row.home_team_id, (0, 0.))
away_goals = teams_rolling_goals.get(row.away_team_id, (0, 0.))
# calc a rolling average
home_avg = np.nan if home_goals[1] == 0 \
else home_goals[0] / home_goals[1]
away_avg = np.nan if away_goals[1] == 0 \
else away_goals[0] / away_goals[1]
# save the averages for the row
home_rolling_avg.append(home_avg)
away_rolling_avg.append(away_avg)
# accum rolling goals/games count
teams_rolling_goals[row.home_team_id] = (
home_goals[0] + int(row.total_goals), home_goals[1] + 1)
teams_rolling_goals[row.away_team_id] = (
away_goals[0] + int(row.total_goals), away_goals[1] + 1)
print(row.home_team_id, home_rolling_avg[-1],
row.away_team_id, away_rolling_avg[-1])
# save the results
df['home_rolling_avg'] = home_rolling_avg
df['away_rolling_avg'] = away_rolling_avg
使用的样本数据:
data = np.array([
('week', 'total_goals', 'home_team_id', 'away_team_id', 'home_goals',
'away_goals', 'home_rolling_avg', 'away_rolling_avg'),
(0, 2, 1277, 1241, 1, 1, np.nan, np.nan),
(1, 3, 1245, 1249, 2, 1, np.nan, np.nan),
(2, 1, 1242, 1246, 0, 1, np.nan, np.nan),
(3, 4, 1261, 1248, 1, 3, np.nan, np.nan),
(4, 2, 1259, 1240, 2, 0, np.nan, np.nan),
(5, 3, 2981, 1268, 1, 2, np.nan, np.nan),
(6, 1, 1244, 1255, 1, 0, np.nan, np.nan),
(7, 1, 1254, 1276, 1, 0, np.nan, np.nan),
(8, 7, 1247, 12140, 5, 2, np.nan, np.nan),
(9, 3, 5681, 1270, 2, 1, np.nan, np.nan),
(10, 1, 1249, 5681, 0, 1, np.nan, np.nan),
])
index = data[1:, 0]
df = pd.DataFrame(data=data[1:, 1:],
index=index,
columns=data[0, 1:])