我正在处理有关大联盟棒球比赛出勤率的数据。
我正在尝试在数据框中创建一个新列,以返回指定对手球队比赛的最接近日期(但不能在给定日期之后)。
例如,对于包含有关洛杉矶天使游戏数据的行:
Game_Num Date Team Win Attendance Net Wins
23 2010-04-05 LAA 1 43504 12
我想查找洛杉矶道奇('LAD')游戏的最近日期并将其附加在新列中。
我的最终目标是创建另一个列,该列显示比赛中对手球队的净胜利是什么,以便我可以查看是否另一支球队在影响门票销售的情况下表现出色。
到目前为止,这是我尝试过的:
for index, row in bbattend.iterrows():
if row['Team'] == 'LAA':
basedate = row['Date']
tempdf = bbattend.loc[(bbattend['Team'] == 'LAD') & (bbattend['Date'] < basedate)]
tempdf['Datediff'] = abs(basedate-tempdf['Date']).days
mindiff = tempdf['Datediff'].min()
bbattend['CloseRivalDate'] = tempdf[tempdf['Date']==mindiff]['Date']
bbattend['RivalNetWins'] = tempdf[tempdf['Date']==mindiff]['Net_Wins']
bbattend['RivalWinPer'] = tempdf[tempdf['Date']==mindiff]['Win_Per']
这是我从中得到的错误:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-128-f2be88528772> in <module>
3 basedate = row['Date']
4 tempdf = bbattend.loc[(bbattend['Team'] == 'LAD') & (bbattend['Date'] < basedate)]
----> 5 tempdf['Datediff'] = abs(basedate-tempdf['Date']).days
6 mindiff = tempdf['Datediff'].min()
7 bbattend['CloseRivalDate'] = tempdf[tempdf['Date']==mindiff]['Date']
~/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
5065 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5066 return self[name]
-> 5067 return object.__getattribute__(self, name)
5068
5069 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'days'
这是我的数据框代码,以防万一:
import requests
import pandas as pd
import numpy as np
from datetime import datetime
import re
Teams = ['LAA', 'LAD', 'NYY', 'NYM', 'CHC', 'CHW', 'OAK', 'SFG']
Years = []
for year in range(2010,2020):
Years.append(str(year))
list_of_df = list()
for team in Teams:
for year in Years:
url = 'https://www.baseball-reference.com/teams/' + team + '/' + year +'-schedule-scores.shtml'
dfname = team + '_' + year
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
#Formatting data table
df.rename(columns={"Gm#": "GM_Num", "Unnamed: 4": "Home", "Tm": "Team", "D/N": "Night"}, inplace = True)
df['Home'] = df['Home'].apply(lambda x: 0 if x == '@' else 1)
df['Game_Win'] = df['W/L'].astype(str).str[0]
df['Game_Win'] = df['Game_Win'].apply(lambda x: 0 if x == 'L' else 1)
df['Night'] = df['Night'].apply(lambda x: 1 if x == 'N' else 0)
df['Streak'] = df['Streak'].apply(lambda x: -1*len(x) if '-' in x else len(x))
df.drop('Unnamed: 2', axis=1, inplace = True)
df.drop('Orig. Scheduled', axis=1, inplace = True)
df.drop('Win', axis=1, inplace = True)
df.drop('Loss', axis=1, inplace = True)
df.drop('Save', axis=1, inplace = True)
#Drop rows that do not have data
df = df[df['GM_Num'].str.isdigit()]
WL = df["W-L"].str.split("-", n = 1, expand = True)
df["Wins"] = WL[0].astype(dtype=np.int64)
df["Losses"] = WL[1].astype(dtype=np.int64)
df['Net_Wins'] = df['Wins'] - df['Losses']
df['Win_Per'] = df['Wins']/(df['Wins']+df['Losses'])
DayDate = df['Date'].str.split(", ", n = 1, expand = True)
df['DayOfWeek'] = DayDate[0]
df['Date'] = DayDate[1] + ', ' + year
df['Date'] = [re.sub("\s\(\d+\)", "", str(x)) for x in df['Date']]
df['Date'] = pd.to_datetime(df['Date'], format='%b %d, %Y')
list_of_df.append(df)
bbattend = pd.concat(list_of_df)
bbattend
我知道这绝对不是最有效的方法,但是可以得到我想要的结果。
答案 0 :(得分:0)
您所看到的特定错误可以通过替换行来解决
tempdf['Datediff'] = abs(basedate-tempdf['Date']).days
作者
tempdf['Datediff'] = abs(basedate-tempdf['Date']).dt.days
但是,您的代码仍然会产生不正确的结果。
我认为,执行所需操作的正确方法是在表上执行自联接,合并Team
和Opp
,然后计算团队之间的日期差比赛日期和赛季中每场Opp的比赛,对在Team比赛之前发生的Opp比赛进行过滤,最后,仅保留与Team比赛最接近的Opp比赛。
这是执行此操作的代码:
# Create column Year to help on self-join
bbattend["Year"] = bbattend.Date.dt.year
# Create merged table
merged = bbattend.merge(
bbattend[["Date", "Year", "Team", "Net_Wins", "Win_Per"]],
how="inner",
left_on=["Year", "Opp"],
right_on=["Year", "Team"],
suffixes=('', '_opp')
)
merged["date_diff"] = (merged.Date - merged.Date_opp).dt.days
# Keep only closest game from Opp
def get_closest_date(g):
row_to_keep = g.date_diff.idxmin()
return g.loc[row_to_keep, ["Date_opp", "Team_opp", "Net_Wins_opp", "Win_Per_opp", "date_diff"]]
merged.groupby(bbattend.columns.to_list()).apply(get_closest_date).reset_index()
结果表将在上一场比赛之后添加Opp团队的Net_wins
和Win_per
列。
注意:
Team
列中的游戏。Team
和Opp
处于切换状态。如果您想摆脱这些,只需过滤Home == 1
最后,这确实推动了我在大熊猫工作中的极限。如果表变大,则自我联接将需要更多的内存。我建议您将此表加载到关系数据库中(sqlite是Python附带的最简单的用法),并使用SQL进行此计算。
答案 1 :(得分:0)
这是我最终使用的最终代码: 它基于@foglerit的答案
init
我添加了date_diff必须为正的条件,因为我想要在比赛之前发生的同一市场团队的比赛日期。
然后我按date_diff对数据框进行排序,并删除game_id的重复项,以便最终数据框仅具有最小的date_diff。