Python Scrape NBA跟踪驱动数据

时间:2019-11-19 16:20:49

标签: python web-scraping get google-chrome-devtools

我对Python相当陌生。我正在尝试通过https://stats.nba.com/players/drives/

抓取NBA Drives数据

我使用Chrome Devtools查找API URL。然后,我使用了请求包来获取JSON字符串。

原始代码:

import requests
headers = {"User-Agent": "Mozilla/5.0..."}
url = " https://stats.nba.com/stats/leaguedashptstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&PlayerExperience=&PlayerOrTeam=Player&PlayerPosition=&PtMeasureType=Drives&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="
r = requests.get(url, headers = headers)
d = r.json()

但是,这不再起作用。由于某些原因,下面的URL链接请求在NBA服务器上超时。因此,我需要找到一种获取此信息的新方法。

<https://stats.nba.com/stats/leaguedashptstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&PlayerExperience=&PlayerOrTeam=Player&PlayerPosition=&PtMeasureType=Drives&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=>

我正在浏览Chrome devtools,发现所需的JSON字符串存储在“网络XHR响应”选项卡中。有什么办法可以将其抓取到python中。请参见下图。

Chrome Devtools: XHR Response JSON string

1 个答案:

答案 0 :(得分:2)

我测试了带有其他标头的url(我在DevTool中看到了此请求),看来它需要标头Referer才能正常工作

import requests

headers = {
    'User-Agent': 'Mozilla/5.0',
    #'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0',
    'Referer': 'https://stats.nba.com/players/drives/',
    #'Accept': 'application/json, text/plain, */*',
}

url = 'https://stats.nba.com/stats/leaguedashptstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&PlayerExperience=&PlayerOrTeam=Player&PlayerPosition=&PtMeasureType=Drives&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight='
r = requests.get(url, headers=headers)
data = r.json()

print(data)

顺便说一句:相同,但具有作为字典的参数,因此更容易设置不同的值

import requests

headers = {
    'User-Agent': 'Mozilla/5.0',
    #'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0',
    'Referer': 'https://stats.nba.com/players/drives/',
    #'Accept': 'application/json, text/plain, */*',
}

url = 'https://stats.nba.com/stats/leaguedashptstats'

params = {
    'College': '',
    'Conference': '',
    'Country': '',
    'DateFrom': '',
    'DateTo': '',
    'Division': '',
    'DraftPick': '',
    'DraftYear': '',
    'GameScope': '',
    'Height': '',
    'LastNGames': '0',
    'LeagueID': '00',
    'Location': '',
    'Month': '0',
    'OpponentTeamID': '0',
    'Outcome': '',
    'PORound': '0',
    'PerMode': 'PerGame',
    'PlayerExperience': '',
    'PlayerOrTeam': 'Player',
    'PlayerPosition': '',
    'PtMeasureType': 'Drives',
    'Season': '2019-20',
    'SeasonSegment': '',
    'SeasonType': 'Regular Season',
    'StarterBench': '',
    'TeamID': '0',
    'VsConference': '',
    'VsDivision': '',
    'Weight': '',
}

r = requests.get(url, headers=headers, params=params)
#print(r.request.url)
data = r.json()

print(data)