我正试图从whoscored.com抓取所有EPL玩家的链接(链接在变量root下面) 这是代码:
from bs4 import BeautifulSoup
from selenium import webdriver
root = "https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6335/Stages/13796/PlayerStatistics/England-Premier-League-2016-2017"
driver = webdriver.PhantomJS()
driver.get(root)
page = driver.page_source
soup = BeautifulSoup(page, "html.parser")
players = soup.find("div", {'id':'statistics-table-summary'})
print(players)
如果你进入页面,你会看到一个玩家列表和一个显示接下来的10个玩家的下一个按钮(29页中有284个) 我想要的输出:保存每个十个玩家个人资料的链接,然后继续下一页与接下来的十个玩家一起完成
要做到这一点我以为我会做soup.find_all('a',{'class':'player-link})
因为播放器的链接和名称都在这样的容器中,但是这不会返回。所以我想我会首先找到所有在那里的桌子,但这也没有返回。对此有何看法?
提前谢谢
答案 0 :(得分:2)
在获取.page_source
之前,您需要wait才能加载表:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# ...
driver.get(root)
# wait for at least one player to be present in the statistics table
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#statistics-table-summary .player-link")))
page = driver.page_source
driver.close()
# ...