我正在通过NBA网站抓取球员的姓名。播放器的名称网页是使用单页应用程序设计的。播放器按字母顺序分布在几页上。我无法提取所有玩家的姓名。 这是链接:https://in.global.nba.com/playerindex/
from selenium import webdriver
from bs4 import BeautifulSoup
class make():
def __init__(self):
self.first=""
self.last=""
driver= webdriver.PhantomJS(executable_path=r'E:\Downloads\Compressed\phantomjs-2.1.1-windows\bin\phantomjs.exe')
driver.get('https://in.global.nba.com/playerindex/')
html_doc = driver.page_source
soup = BeautifulSoup(html_doc,'lxml')
names = []
layer = soup.find_all("a",class_="player-name ng-isolate-scope")
for a in layer:
span = a.find("span",class_="ng-binding")
thing = make()
thing.first = span.text
spans = a.find("span",class_="ng-binding").find_next_sibling()
thing.last = spans.text
names.append(thing)
答案 0 :(得分:2)
使用SPA时,您不应尝试从DOM中提取信息,因为如果不运行具有JS功能的浏览器来填充DOM,则DOM是不完整的。
但是大多数SPA使用AJAX请求加载其数据,因此您需要监视来自开发者控制台(F12)的网络请求。
这里https://in.global.nba.com/playerindex/
从https://in.global.nba.com/stats2/league/playerlist.json?locale=en
加载播放器数据
自己模拟该请求,然后选择所需的任何内容。
import requests
if __name__ == '__main__':
page_url = 'https://in.global.nba.com/playerindex/'
s = requests.Session()
s.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'}
# visit the homepage to populate session with necessary cookies
res = s.get(page_url)
res.raise_for_status()
json_url = 'https://in.global.nba.com/stats2/league/playerlist.json?locale=en'
res = s.get(json_url)
res.raise_for_status()
data = res.json()
player_names = [p['playerProfile']['displayName'] for p in data['payload']['players']]
print(player_names)
输出:
['Steven Adams', 'Bam Adebayo', 'Deng Adel', 'LaMarcus Aldridge', 'Kyle Alexander', 'Nickeil Alexander-Walker', ...