我正在尝试对足球网站进行多页处理。所有链接都在列表teamLinks中。链接之一的示例是:“ http://www.premierleague.com//clubs/1/Arsenal/squad?se=79”。 我只是想知道是否可以让请求功能等到页面完全更新后再实施。如果您单击链接,它将首先显示2018/2019小队,然后刷新到我想要的2017/2018小队。
playerLink1 = []
playerLink2 = []
for i in range(len(teamLinks)):
# Request
squadPage = requests.get(teamlinks[i])
squadTree = html.fromstring(squadPage.content)
#Extract the player links.
playerLocation = squadTree.cssselect('.playerOverviewCard')
#For each player link within the team page.
for i in range(len(playerLocation)):
#Save the link, complete with domain.
playerLink1.append("http://www.premierleague.com/" +
playerLocation[i].attrib['href'] + '?se=79')
#For the second link, change the page from player overview to stats
playerLink2.append(playerLink1[i].replace("overview", "stats"))
答案 0 :(得分:1)
我找到了一个解决方案。您必须在webdriver
模式下使用硒headless
,并从驱动程序获取page_source
并提供一些time.sleep()
。我已经检查了数据它按预期显示。
但是我不知道您的网址列表,因此您可以创建列表并尝试使用它。如果您需要进一步的帮助,请告诉我。
from selenium import webdriver
from bs4 import BeautifulSoup
import time
teamlinks=['http://www.premierleague.com//clubs/1/Arsenal/squad?se=79','http://www.premierleague.com//clubs/1/Arsenal/squad?se=54']
playerLink1 = []
playerLink2 = []
for i in range(len(teamlinks)):
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('window-size=1920x1080');
driver = webdriver.Chrome(options=chrome_options)
driver.get(teamlinks[i])
time.sleep(10)
squadPage=driver.page_source
soup = BeautifulSoup(squadPage, 'html.parser')
playerLocation = soup.findAll('a', class_=re.compile("playerOverviewCard"))
for i in range(len(playerLocation)):
#Save the link, complete with domain.
playerLink1.append("http://www.premierleague.com/" +
playerLocation[i]['href'] + '?se=79')
#For the second link, change the page from player overview to stats
playerLink2.append(playerLink1[i].replace("overview", "stats"))
driver.quit()
print(playerLink2)
答案 1 :(得分:1)
您要抓取的页面正在使用Javascript加载所需的播放器列表。
选项1:您可以使用称为requests-html(从未尝试过)的新模块,该模块声称支持Javascript。
选项2:使用Chrome的devtool,我可以找到按页面发出的实际XHR请求以获取播放器列表。此代码可以通过请求模块获取所需的输出。
import json
playerLink1 = []
playerLink2 = []
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36',
'Origin': 'https://www.premierleague.com',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Referer': 'https://www.premierleague.com//clubs/1/Arsenal/squad?se=79'}
res = requests.get('https://footballapi.pulselive.com/football/teams/1/compseasons/79/staff?altIds=true&compCodeForActivePlayer=EN_PR', headers=headers)
player_data = json.loads(res.content.decode('utf-8'))
for player in player_data['players']:
href = 'https://www.premierleague.com/players/{}/{}/'.format(player['id'], player['name']['display'].replace(' ', '-'))
playerLink1.append("http://www.premierleague.com/" + href + "overview" + '?se=79')
playerLink2.append(href + "stats")