我对Python有点陌生,无法终生了解为什么下面的代码没有提取我要获取的元素。
当前返回:
for player in all_players:
player_first, player_last = player.split()
player_first = player_first.lower()
player_last = player_last.lower()
first_name_letters = player_first[:2]
last_name_letters = player_last[:5]
player_url_code = '/{}/{}{}01'.format(last_name_letters[0], last_name_letters, first_name_letters)
player_url = 'https://www.basketball-reference.com/players' + player_url_code + '.html'
print(player_url) #test
req = urlopen(player_url)
soup = bs.BeautifulSoup(req, 'lxml')
wrapper = soup.find('div', id='all_advanced_pbp')
table = wrapper.find('div', class_='table_outer_container')
for td in table.find_all('td'):
player_pbp_data.append(td.get_text())
当前返回:
--> for td in table.find_all('td'):
player_pbp_data.append(td.get_text()) #if this works, would like to
AttributeError: 'NoneType' object has no attribute 'find_all'
注意:遍历包装对象的子对象将返回:
< div class="table_outer_container" >
作为树的一部分。
谢谢!
答案 0 :(得分:0)
尝试显式传递html:
bs.BeautifulSoup(the_html, 'html.parser')
答案 1 :(得分:0)
确保table
包含您期望的数据。
例如https://www.basketball-reference.com/players/a/abdulka01.html似乎不包含带有div
的{{1}}
答案 2 :(得分:0)
我试图从您提供的URL中提取数据,但没有获得完整的DOM。之后,我尝试使用带有javascrip且没有javascrip的浏览器访问页面,我知道网站需要javascrip来加载一些数据。但是不需要players
之类的页面。获取动态数据的简单方法是使用硒
这是我的测试代码
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
player_pbp_data = []
def get_list(t="a"):
with requests.Session() as se:
url = "https://www.basketball-reference.com/players/{}/".format(t)
req = se.get(url)
soup = BeautifulSoup(req.text,"lxml")
with open("a.html","wb") as f:
f.write(req.text.encode())
table = soup.find("div",class_="table_wrapper setup_long long")
players = {player.a.text:"https://www.basketball-reference.com"+player.a["href"] for player in table.find_all("th",class_="left ")}
def get_each_player(player_url="https://www.basketball-reference.com/players/a/abdulta01.html"):
with webdriver.Chrome() as ph:
ph.get(player_url)
text = ph.page_source
'''
with requests.Session() as se:
text = se.get(player_url).text
'''
soup = BeautifulSoup(text, 'lxml')
try:
wrapper = soup.find('div', id='all_advanced_pbp')
table = wrapper.find('div', class_='table_outer_container')
for td in table.find_all('td'):
player_pbp_data.append(td.get_text())
except Exception as e:
print("This page dose not contain pbp")
get_each_player()