如何使用bs4在python中抓取单页应用程序网站

时间:2019-07-16 17:52:57

标签: web-scraping beautifulsoup

我正在通过NBA网站抓取球员的姓名。播放器的名称网页是使用单页应用程序设计的。播放器按字母顺序分布在几页上。我无法提取所有玩家的姓名。 这是链接:https://in.global.nba.com/playerindex/

from selenium import webdriver
from bs4 import BeautifulSoup

class make():
    def __init__(self):
        self.first=""
        self.last=""

driver= webdriver.PhantomJS(executable_path=r'E:\Downloads\Compressed\phantomjs-2.1.1-windows\bin\phantomjs.exe')

driver.get('https://in.global.nba.com/playerindex/')

html_doc = driver.page_source


soup = BeautifulSoup(html_doc,'lxml')

names = []

layer = soup.find_all("a",class_="player-name ng-isolate-scope")
for a in layer:
    span = a.find("span",class_="ng-binding")
    thing = make()
    thing.first = span.text
    spans = a.find("span",class_="ng-binding").find_next_sibling()
    thing.last = spans.text
    names.append(thing)

1 个答案:

答案 0 :(得分:2)

使用SPA时,您不应尝试从DOM中提取信息,因为如果不运行具有JS功能的浏览器来填充DOM,则DOM是不完整的。

但是大多数SPA使用AJAX请求加载其数据,因此您需要监视来自开发者控制台(F12)的网络请求。

这里https://in.global.nba.com/playerindex/https://in.global.nba.com/stats2/league/playerlist.json?locale=en加载播放器数据

自己模拟该请求,然后选择所需的任何内容。

import requests

if __name__ == '__main__':
    page_url = 'https://in.global.nba.com/playerindex/'
    s = requests.Session()
    s.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'}

    # visit the homepage to populate session with necessary cookies
    res = s.get(page_url)
    res.raise_for_status()

    json_url = 'https://in.global.nba.com/stats2/league/playerlist.json?locale=en'
    res = s.get(json_url)
    res.raise_for_status()
    data = res.json()

    player_names = [p['playerProfile']['displayName'] for p in data['payload']['players']]
    print(player_names)

输出:

['Steven Adams', 'Bam Adebayo', 'Deng Adel', 'LaMarcus Aldridge', 'Kyle Alexander', 'Nickeil Alexander-Walker', ...