无法使用BeautifulSoup Python在NBA Stats网站上找到<div ng-view>

时间:2019-06-06 03:33:09

标签: python selenium web-scraping beautifulsoup python-requests

我正在尝试删除此NBA网站https://stats.nba.com/team/1610612738/。我想做的是为每个玩家提取玩家的姓名,NO,POS和所有信息。问题是我找不到或我的代码找不到表所在的<div ng-view>的父级<nba-stat-table >

到目前为止,我的代码是:

from selenium import webdriver
from bs4 import BeautifulSoup

def get_Player():
    driver = webdriver.PhantomJS(executable_path=r'D:\Documents\Python\Web Scraping\phantomjs.exe')

    url = 'https://stats.nba.com/team/1610612738/'

    driver.get(url)

    data = driver.page_source.encode('utf-8')

    soup = BeautifulSoup(data, 'lxml')

    div1 = soup.find('div', class_="columns / small-12 / section-view-overlay")
    print(div1.find_all('div'))

get_Player()

3 个答案:

答案 0 :(得分:2)

使用页面用于获取该内容的json响应端点。更容易,更轻松地处理,并且不需要硒。您可以在“网络”标签中找到它。

import requests
import pandas as pd

r = requests.get('https://stats.nba.com/stats/commonteamroster?LeagueID=00&Season=2018-19&TeamID=1610612738',  headers = {'User-Agent' : 'Mozilla/5.0'}).json()
players_info = r['resultSets'][0]
df = pd.DataFrame(players_info['rowSet'], columns = players_info['headers'])
print(df.head())

enter image description here

答案 1 :(得分:1)

find_all函数始终返回列表,findChildren()返回标签对象more details的所有子对象

替换您的代码:

div1 = soup.find('div', class_="columns / small-12 / section-view-overlay")
print(div1.find_all('div')) 

收件人:

div = soup.find('div', {'class':"nba-stat-table__overflow"})
for tr in div.find("tbody").find_all("tr"):
    for td in tr.findChildren():
        print(td.text)

更新:

from selenium import webdriver

from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def get_Player():
    driver = webdriver.PhantomJS(executable_path=r'D:\Documents\Python\Web Scraping\phantomjs.exe')

    url = 'https://stats.nba.com/team/1610612738/'

    driver.get(url)

    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "nba-stat-table__overflow")))

    data = driver.page_source.encode('utf-8')

    soup = BeautifulSoup(data, 'lxml')

    div = soup.find('div', {'class':"nba-stat-table__overflow"})
    for tr in div.find("tbody").find_all("tr"):
        for td in tr.findChildren():
            print(td.text)

get_Player()

O / P:

Jayson Tatum
Jayson Tatum
#0
F
6-8
208 lbs
MAR 03, 1998
21
1
Duke
Jonathan Gibson
Jonathan Gibson
#3
G
6-2
185 lbs
NOV 08, 1987
31
2
New Mexico State
....

答案 2 :(得分:0)

为什么要查找所有div's,如果只是要提取的 Player 名称,则可以使用此{{1 }}:

css selector

代码

td.player a