Python请求复杂的帖子

时间:2017-09-25 01:30:36

标签: python beautifulsoup python-requests

我正在尝试使用请求转到此网址的“下一个”(右箭头)页面:

https://www.sportstats.ca/display-results.xhtml?raceid=43572

当我在浏览器中手动执行此操作时,我使用了chrome开发人员工具检查响应,并且我尝试将表单数据放在一起并使用请求发布帖子,但我回复的响应仍显示页面1中的内容。有小费吗?我也尝试过使用混合结果的Selenium,如果可能的话,我宁愿坚持使用轻量级请求。这是我的尝试:

#!/usr/bin/env python
import requests
from bs4 import BeautifulSoup

url = 'https://www.sportstats.ca/display-results.xhtml?raceid=43572'
with requests.Session() as s:
    r1 = s.get(url)
    pagenum = [x for x in r1.text.splitlines() if '<p>Page' in x][0].strip()
    print(pagenum)
    soup = BeautifulSoup(r1.text, 'html.parser')
    hidden_inputs = soup.findAll('input', {'type': 'hidden'})
    prepayload = {x['name']: x['value'] for x in hidden_inputs}
    payload = {}
    payload['javax.faces.partial.ajax'] = 'true'
    payload['javax.faces.source'] = 'mainForm:j_idt386'
    payload['javax.faces.partial.execute'] = 'mainForm'
    payload['javax.faces.partial.render'] = 'mainForm:result_table mainForm:pageNav mainForm:eventAthleteDetailsDialog'
    payload['mainForm:j_idt386'] = 'mainForm:j_idt386'
    payload['mainForm'] = prepayload['mainForm']
    payload['mainForm:raceid'] = prepayload['mainForm:raceid']
    payload['mainForm:status'] = prepayload['mainForm:status']
    payload['mainForm:iframe'] = prepayload['mainForm:iframe']
    payload['mainForm:bib'] = ''
    payload['mainForm:lastname'] = ''
    payload['mainForm:city'] = ''
    payload['mainForm:firstname'] = ''
    payload['mainForm:province'] = ''
    payload['mainForm:categoryFilter'] = 'All Categories'
    payload['javax.faces.ViewState'] = prepayload['javax.faces.ViewState']
    r2 = s.post(url, data=payload)
    pagenum = [x for x in r2.text.splitlines() if '<p>Page' in x][0].strip()
    print(pagenum)

这回来了:

[myname@myserver] $ ./sstest.py
<p>Page 1 / 19  
<p>Page 1 / 19

1 个答案:

答案 0 :(得分:1)

您要废弃的网站更适合硒。

您只需要获取访问网站的总页数,然后循环遍历总页数,并在每次循环时单击下一个按钮。

在每个循环中,您可以像平常一样对每个页面进行必要的解析。

通过这种方式,您可以根据网站页面中的页数动态分析每个页面。

代码:

#!/usr/bin/env python
import time
from bs4 import BeautifulSoup
from selenium import webdriver

# Intializations
driver = webdriver.Chrome()
url = 'https://www.sportstats.ca/display-results.xhtml?raceid=43572'
driver.get(url)
driver.maximize_window()
bs = BeautifulSoup(driver.page_source, 'html.parser')

# Retrieve the total number of pages
PagesParser = driver.find_element_by_xpath('//*[@id="mainForm:pageNav"]/div/p')
pages = int(str(PagesParser.text).split('/')[1].replace(' ', ''))
print(pages)

# Loops over every page
for i in range(1, pages+1):
    print('page: ' + str(i))
    # Do your parsing here for every page
    time.sleep(5)
    driver.find_element_by_xpath('//*[@id="mainForm:j_idt386"]').click() # Clicks the next button

输出:

19
page: 1
page: 2
page: 3
page: 4
page: 5
page: 6
page: 7
page: 8
page: 9
page: 10
page: 11
page: 12
page: 13
page: 14
page: 15
page: 16
page: 17
page: 18
page: 19