这是我正在尝试创建的主数据帧循环的代码。
import requests
import pandas as pd
"""
from: http://stats.nba.com/league/player/#!/advanced/
"""
u_a = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 Safari/537.36"
advanced = "http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="
passing = "http://stats.nba.com/stats/leaguedashptstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&PlayerExperience=&PlayerOrTeam=Player&PlayerPosition=&PtMeasureType=Possessions&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="
scoring = "http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Scoring&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="
url_list = [advanced,passing,scoring]
master_df = []
for i in url_list:
r = requests.get(i, headers={"USER-AGENT":u_a})
r.raise_for_status()
headers = []
for item in r.json()['resultSets']:
for val in item['headers']:
headers.append(val)
df = []
for item in r.json()['resultSets']:
for row in item['rowSet']:
row_df = []
for val in row:
row_df.append(val)
df.append(row_df)
master_df.append(df)
循环有效,但它将每组数据叠加在另一组之上。我希望合并数据,以便不会复制相同的列,如果有意义的话,每个JSON文件中的新数据都会添加到其他列中。我还希望标题只添加一个列名,如果它是新的。
答案 0 :(得分:0)
您没有使用标题,也没有创建数据框。
这可能与您想要的内容很接近,但我认为您可能希望最终得到每个网址的列表(然后将这些文件中的pd.concat添加到单个数据框中,然后将其添加到master_df_list中)似乎返回相同的数据。
# Keeping your import statements etc as per your question
[...]
master_df_list = []
for i in url_list:
# Option: Maybe here you may want to create a list
# to concat before adding to master_df
# url_df_list = []
r = requests.get(i, headers={"USER-AGENT":u_a})
r.raise_for_status()
data = r.json()
# Get the headers
headers = data['resultSets'][0]['headers']
# And the rowSet (whatever that is...)
shot_data = data['resultSets'][0]['rowSet']
# Create a beautiful df from that
df = pd.DataFrame(shot_data,columns=headers)
master_df_list.append(df)
# Option:
# url_df_list.append(df)
# In which case you would concat too
# df_concat = pd.concat(url_df_list)
# master_df_list.append(df_concat)
# Concat
master_df = pd.concat(master_df_list)
答案 1 :(得分:0)
考虑在reduce(lambda..., pd.merge))
之间合并数据框列表:
from functools import reduce
...
url_list = [advanced,passing,scoring]
dfList = []
for i in url_list:
r = requests.get(i, headers={"USER-AGENT":u_a})
r.raise_for_status()
data = r.json()
df = pd.DataFrame(data["resultSets"][0]["rowSet"],
columns=data["resultSets"][0]["headers"])
dfList.append(df)
finaldf = reduce(lambda left,right: pd.merge(left, right,
on=['PLAYER_ID', 'PLAYER_NAME', 'TEAM_ID', 'TEAM_ABBREVIATION']), dfList)
请注意,任何重复字段,例如Age
,W
,L
(未完全出现在所有数据框中)都将以_x
,{{为后缀1}}。