假设你有多个Pandas数据框,其中包含一个赛季的运动队游戏数据。我碰巧有一个赛季所有NHL比赛的数据,每个球队分别。对于团队而言,其数据框如下所示:
# An example of a NHL team's data frame (the data are made up):
Goals for Goals against Opponent O/S Place Points Games played
Date
2015-12-1 3 2 ANAHEIM OT Home 15 12
2015-12-3 1 5 CHICAGO NaN Visit 15 13
2015-12-5 3 4 MONTREAL SO Home 16 14
2015-12-8 1 0 DALLAS NaN Home 18 15
...
确定球队每日排名的最有效方法是什么?为简单起见,我们只考虑联赛水平排名而不是会议/分区排名。我想将这些排名连成一个数据框,看起来像这样:
# Concatenated league level standings by date (the data are made up):
Team BOSTON BUFFALO CALGARY CAROLINA ...
Date
2015-12-1 1 32 10 15
2015-12-2 3 28 9 9
2015-12-3 2 26 10 4
2015-12-4 6 27 13 1
2015-12-5 2 25 15 3
2015-12-6 5 28 16 2
...
我实际上已经设法确定了自己的排名,但是我的实施速度很快且虚假。基本上,我(1)循环一个团队A玩的每个游戏(即团队数据框中的行),(2)检索另一个团队B玩的最新游戏(即在之前或同一天播放) (3)如果球队B在积分榜上比基于NHL规则(http://sports.espn.go.com/nhl/news/story?page=nhl/tiebreakers)的球队A更高(即更好),那么球队A的当前位置会增加1(即我已经为每个团队的数据框添加了一个常设列,默认的站点是一个。在经历了所有球队的所有比赛之后,我将积分榜连接成一个数据框。
我强烈认为有一种更有效的方法来解决我的问题,以更好的方式利用Pandas的功能。因为我无法弄清楚如何根据日期索引来对齐来自不同数据帧的行(即游戏),所以我不得不去寻找循环事物。此外,如果我碰巧知道如何对齐行,我不知道如何对列进行排序(即对团队进行排名)。
我认为解决这个相当具体的问题所需的相同技术可以应用于许多类似的情况,例如排名股票。例如,如果您想根据某些条件(例如行业级别排名)对股票的每日回报进行排名,我想这需要一种非常类似于此处所需的方法。
提前谢谢!
答案 0 :(得分:2)
根据@JohnE的回答,我设法提出了这个问题:
import pandas as pd
import numpy as np
# Generating some non-random data
rng = [ '2015-10-01', '2015-10-02', '2015-10-03', '2015-10-04',
'2015-10-01', '2015-10-03', '2015-10-04', '2015-10-06',
'2015-10-01', '2015-10-04', '2015-10-05', '2015-10-06' ]
df = pd.DataFrame( { 'Team': [ 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'C', 'C', 'C', 'C' ],
'Opponent': [ 'B', 'E', 'F', 'G', 'A', 'H','I', 'C', 'J', 'K', 'L', 'B' ],
'Goals for': [ 4, 2, 6, 1, 5, 5, 7, 1, 1, 2, 1, 2 ],
'Goals against': [ 5, 1, 5, 3, 4, 4, 6, 2, 2, 0, 2, 1 ],
'OT/SO': [ 'o', np.nan, 's', np.nan, 'o', 'o', 's', 's', np.nan, np.nan, 'o', 's' ] },
index = pd.to_datetime( rng ) )
# Calculating basic data
df[ 'Points' ] = 0
df.loc[ ( df[ 'Goals for' ] > df[ 'Goals against' ] ), 'Points' ] = 2
df.loc[ ( df[ 'Goals for' ] < df[ 'Goals against' ] ) & ( df[ 'OT/SO' ].isnull() == False ), 'Points' ] = 1
df[ 'Non-SO Win' ] = df[ 'Points' ] == 2
df.loc[ df[ 'OT/SO' ] == 's', 'Non-SO Win' ] = False
df[ 'Goal differential' ] = df[ 'Goals for' ] - df[ 'Goals against' ]
# Determining the standings
results = pd.DataFrame()
for date in set( rng ):
# aggregating the necessary data
data = df[ : date ]
aggr_data = data.groupby('Team').agg( { 'Points': [ 'sum', 'count' ],
'Non-SO Win': [ 'count' ],
'Goal differential': [ 'sum' ] } )
# Sorting the aggregated df based on (simplified) NHL rules
aggr_data.sort( [ ( 'Points', 'sum' ), # Points
( 'Points', 'count' ), # Games played
( 'Non-SO Win', 'count' ), # Non-SO wins
( 'Goal differential', 'sum' ) ], # Goal differential
ascending = [ False, True, False, False ],
inplace = True )
# Adding standings = row numbers
aggr_data[ 'Standing' ] = [ i for i in range( 1, aggr_data.count().values[0] + 1 ) ]
results = pd.concat( [ results, aggr_data[ 'Standing' ] ], axis = 'Team' )
results.columns = set( rng )
results = results.T
results.sort_index( inplace = True )
我的答案并不完整,因为没有考虑到头对头的规则......这是规则中最麻烦的,IMO。除此之外,我认为这种方法说明了当有多个排名标准时,使用“排序”而不是“排名”是如何有用的。
答案 1 :(得分:1)
以下示例数据。我将使用&#39; team&#39;而不是对手,因为它似乎更自然,但它并不重要。使用团队或其对手(但不是两者)都是基本排名所需要的,尽管你需要他们两个来计算头对头的打破平局。但是,让我们开始吧。
import numpy as np
import pandas as pd
np.random.seed(123)
rng = pd.date_range('2015-12-1',periods=4)
df=pd.DataFrame({ 'team':['stlouis']*4+['chicago']*4+['carolina']*4,
'goals_for':np.random.randint(0,5,12),
'goals_against':np.random.randint(0,5,12) },
index=np.tile(rng,3))
df['points'] = np.select( [ df.goals_for > df.goals_against,
df.goals_for == df.goals_against ],
[ 2, 1] ) # 2 for win, 1 for tie
df = df[['team','goals_for','goals_against','points']]
team goals_for goals_against points
2015-12-01 stlouis 2 0 2
2015-12-02 stlouis 4 0 2
2015-12-03 stlouis 2 1 2
2015-12-04 stlouis 1 3 0
2015-12-01 chicago 3 4 0
2015-12-02 chicago 2 0 2
2015-12-03 chicago 3 0 2
2015-12-04 chicago 1 4 0
2015-12-01 carolina 1 1 1
2015-12-02 carolina 0 3 0
2015-12-03 carolina 1 2 0
2015-12-04 carolina 1 4 0
现在为每个日期设置一个小循环:
results=pd.DataFrame()
for r in rng.format():
points = df[:r].groupby('team')['points'].sum()
standings = points.rank(ascending=False)
results = pd.concat( [ results, standings ], axis=1 )
results.columns = rng.format()
我不认为那里的任何事情太复杂了。这是中间输出(最终日期):
points
team
carolina 1
chicago 4
stlouis 6
standings
team
carolina 3
chicago 2
stlouis 1
决赛桌,这只是每个日期(和换位)所有排名的串联:
results.T
carolina chicago stlouis
2015-12-01 2 3 1
2015-12-02 3 2 1
2015-12-03 3 2 1
2015-12-04 3 2 1