我正在尝试使用请求转到此网址的“下一个”(右箭头)页面:
https://www.sportstats.ca/display-results.xhtml?raceid=43572
当我在浏览器中手动执行此操作时,我使用了chrome开发人员工具检查响应,并且我尝试将表单数据放在一起并使用请求发布帖子,但我回复的响应仍显示页面1中的内容。有小费吗?我也尝试过使用混合结果的Selenium,如果可能的话,我宁愿坚持使用轻量级请求。这是我的尝试:
#!/usr/bin/env python
import requests
from bs4 import BeautifulSoup
url = 'https://www.sportstats.ca/display-results.xhtml?raceid=43572'
with requests.Session() as s:
r1 = s.get(url)
pagenum = [x for x in r1.text.splitlines() if '<p>Page' in x][0].strip()
print(pagenum)
soup = BeautifulSoup(r1.text, 'html.parser')
hidden_inputs = soup.findAll('input', {'type': 'hidden'})
prepayload = {x['name']: x['value'] for x in hidden_inputs}
payload = {}
payload['javax.faces.partial.ajax'] = 'true'
payload['javax.faces.source'] = 'mainForm:j_idt386'
payload['javax.faces.partial.execute'] = 'mainForm'
payload['javax.faces.partial.render'] = 'mainForm:result_table mainForm:pageNav mainForm:eventAthleteDetailsDialog'
payload['mainForm:j_idt386'] = 'mainForm:j_idt386'
payload['mainForm'] = prepayload['mainForm']
payload['mainForm:raceid'] = prepayload['mainForm:raceid']
payload['mainForm:status'] = prepayload['mainForm:status']
payload['mainForm:iframe'] = prepayload['mainForm:iframe']
payload['mainForm:bib'] = ''
payload['mainForm:lastname'] = ''
payload['mainForm:city'] = ''
payload['mainForm:firstname'] = ''
payload['mainForm:province'] = ''
payload['mainForm:categoryFilter'] = 'All Categories'
payload['javax.faces.ViewState'] = prepayload['javax.faces.ViewState']
r2 = s.post(url, data=payload)
pagenum = [x for x in r2.text.splitlines() if '<p>Page' in x][0].strip()
print(pagenum)
这回来了:
[myname@myserver] $ ./sstest.py
<p>Page 1 / 19
<p>Page 1 / 19
答案 0 :(得分:1)
您要废弃的网站更适合硒。
您只需要获取访问网站的总页数,然后循环遍历总页数,并在每次循环时单击下一个按钮。
在每个循环中,您可以像平常一样对每个页面进行必要的解析。
通过这种方式,您可以根据网站页面中的页数动态分析每个页面。
代码:
#!/usr/bin/env python
import time
from bs4 import BeautifulSoup
from selenium import webdriver
# Intializations
driver = webdriver.Chrome()
url = 'https://www.sportstats.ca/display-results.xhtml?raceid=43572'
driver.get(url)
driver.maximize_window()
bs = BeautifulSoup(driver.page_source, 'html.parser')
# Retrieve the total number of pages
PagesParser = driver.find_element_by_xpath('//*[@id="mainForm:pageNav"]/div/p')
pages = int(str(PagesParser.text).split('/')[1].replace(' ', ''))
print(pages)
# Loops over every page
for i in range(1, pages+1):
print('page: ' + str(i))
# Do your parsing here for every page
time.sleep(5)
driver.find_element_by_xpath('//*[@id="mainForm:j_idt386"]').click() # Clicks the next button
输出:
19
page: 1
page: 2
page: 3
page: 4
page: 5
page: 6
page: 7
page: 8
page: 9
page: 10
page: 11
page: 12
page: 13
page: 14
page: 15
page: 16
page: 17
page: 18
page: 19