我对网络抓取非常陌生,并且在从nba.com抓取一些NBA球员数据时遇到了一些麻烦。我首先尝试使用bs4抓取页面,但遇到了一个问题,经过一些研究后我认为是由于我读过的文章中的“XHR”。我能够找到json格式数据的网址,但我的python程序似乎陷入困境,从不加载数据。再次,我在网络抓取方面非常新,但我想我会看到我是否偏离这里......有什么建议吗?谢谢! (以下代码)
import requests
import json
url = "http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="
resp = requests.get(url=url)
data = json.loads(resp.text)
print(data)
答案 0 :(得分:1)
给它一个机会。它将根据我定义的标题从该页面生成所有类别。顺便说一下,你初次尝试时没有得到回应,因为网页期望你的请求中有一个User-Agent
,以确保请求不是来自机器人,而是来自任何真正的浏览器。但是,我伪造了它并找到了答案。
import requests
url = "http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="
resp = requests.get(url,headers={'User-Agent':'Mozilla/5.0'})
data = resp.json()
storage = data['resultSets']
for elem in storage:
all_list = elem['rowSet']
for item in all_list:
Player_Id = item[0]
Player_name = item[1]
Team_Id = item[2]
Team_abbr = item[3]
print("Player_Id: {} Player_name: {} Team_Id: {} Team_abbr: {}".format(
Player_Id,Player_name,Team_Id,Team_abbr))
答案 1 :(得分:0)
刚才意识到这是因为用户代理标题不同......一旦添加了它就会起作用