说我有一个数据集,其中包含home_team,away_team和column_home_win,away_win列,该数据告诉哪个团队赢得了比赛。像这样:
Home_team Away_Team Home_Win Away_Win gameID
TB CLB 1 0 1
NY ARZ 0 1 2
EDM CAN 1 0 3
NY TB 0 1 4
NY CLB 1 0 5
TB NY 1 0 6
您如何编写顺序计数器来计算团队在过去的比赛中的总获胜次数,而不论团队是主队还是客队。因此,对于gameID:1,每支球队共有0场总胜利。 自从TB赢得第一场比赛以来,他们现在总共获得了1场胜利,而第二场比赛又是NY(gameID:4),而NY之前共有0场胜利。
所以数据看起来像这样:(AT = Away_Team,HT = Home_Team)
Home_team Away_Team Home_Win Away_Win gameID HT'sTotWins AT'sTotWins
TB CLB 1 0 1 0 0
NY ARZ 0 1 2 0 0
EDM CAN 1 0 3 0 0
NY TB 0 1 4 0 1
NY CLB 1 0 5 0 0
TB NY 1 0 6 2 1
我已经读过一些关于GroupBy.cumcount()
的内容,但是我不知道如何写条件。
我希望我不要不清楚我想做什么,如果我能告诉我的话。
答案 0 :(得分:1)
为了更具启发性,我将您的源数据扩展到10个游戏 和“缩短的”列名,以使打印输出不那么宽。
因此,脚本的第一部分生成源DataFrame如下:
import pandas as pd
# Source data
df = pd.DataFrame(data=[
[ 1, 'TB', 'CLB', 1], [ 2, 'NY', 'ARZ', 0],
[ 3, 'EDM', 'CAN', 1], [ 4, 'NY', 'TB', 0],
[ 5, 'NY', 'CLB', 1], [ 6, 'TB', 'NY', 1],
[ 7, 'ARZ', 'CAN', 1], [ 8, 'ARZ', 'TB', 0],
[ 9, 'NY', 'EDM', 1], [10, 'TB', 'CAN', 1]],
columns=['gameID', 'HomeTeam', 'AwayTeam', 'HomeWin']).set_index('gameID')
df['AwayWin'] = 1 - df['HomeWin']
由于获胜的团队可以同时位于HomeTeam
和AwayTeam
中,因此没有
使用单个groupby
的简单方法。
您必须使用两次,以生成每个结果列。
要生成HTWins
(主队的总胜利数),请使用:
hWin = df.HomeTeam.where(df.HomeWin == 1, df.AwayTeam)
hCnt = hWin.groupby(hWin).cumcount()
df['HTWins'] = hCnt.where(df.HomeWin == 1, 0)
要生成ATWins
(客队的总胜利数),请使用:
aWin = df.AwayTeam.where(df.AwayWin == 1, df.HomeTeam)
aCnt = aWin.groupby(aWin).cumcount()
df['ATWins'] = aCnt.where(df.AwayWin == 1, 0)
当您print(df)
时,您将获得:
HomeTeam AwayTeam HomeWin AwayWin HTWins ATWins
gameID
1 TB CLB 1 0 0 0
2 NY ARZ 0 1 0 0
3 EDM CAN 1 0 0 0
4 NY TB 0 1 0 1
5 NY CLB 1 0 0 0
6 TB NY 1 0 2 0
7 ARZ CAN 1 0 1 0
8 ARZ TB 0 1 0 3
9 NY EDM 1 0 1 0
10 TB CAN 1 0 4 0
为帮助理解此脚本的工作原理,请运行每条指令 分别打印结果。
答案 1 :(得分:1)
也许有一种更“优雅”的熊猫方式来做到这一点,但是我只是将事情分解成for循环然后按照这种方式进行。
import copy
import pandas as pd
df = pd.read_csv('sports_data.csv', header=0, delim_whitespace=True)
df["HT'sTotWins"] = 0
df["AT'sTotWins"] = 0
homeWinsAwayWins = {}
homeAwayCount = {'home':0, 'away':0}
for index, row in df.iterrows():
homeTeam = row['Home_team']
awayTeam = row['Away_Team']
if homeTeam not in homeWinsAwayWins:
homeWinsAwayWins[homeTeam] = copy.deepcopy(homeAwayCount)
if awayTeam not in homeWinsAwayWins:
homeWinsAwayWins[awayTeam] = copy.deepcopy(homeAwayCount)
df.loc[index,"HT'sTotWins"] = homeWinsAwayWins[homeTeam]['home'] + homeWinsAwayWins[homeTeam]['away']
df.loc[index,"AT'sTotWins"] = homeWinsAwayWins[awayTeam]['home'] + homeWinsAwayWins[awayTeam]['away']
homeWin = row['Home_Win']
awayWin = row['Away_Win']
if homeWin:
homeWinsAwayWins[homeTeam]['home'] += 1
elif awayWin:
homeWinsAwayWins[awayTeam]['away'] += 1
print(df)
它会打印您想要的内容。