使用帖子请求无法从下一页获取结果

时间:2018-11-25 16:54:09

标签: python python-3.x post web-scraping beautifulsoup

我已经用python编写了一个脚本,以便在填充位于网页右上角的两个输入框(FromThrough)后填充表格数据。我填写的生成结果的日期是08/28/201711/25/2018

当我运行以下脚本时,可以从其首页获得表格结果。

但是,数据通过分页已分布在多个页面上,并且URL保持不变。如何获得下一页的内容?

Url to the site

这是我的尝试:

import requests
from bs4 import BeautifulSoup

url = "https://www.myfloridalicense.com/FLABTBeerPricePosting/"

res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
try:
  evtrgt = soup.select_one("#__EVENTTARGET").get('value')
except AttributeError: evtrgt  = ""
viewstate = soup.select_one("#__VIEWSTATE").get('value')
viewgen = soup.select_one("#__VIEWSTATEGENERATOR").get('value')
eventval = soup.select_one("#__EVENTVALIDATION").get('value')

payload = {
  '__EVENTTARGET': evtrgt,
  '__EVENTARGUMENT': '',
  '__VIEWSTATE':viewstate, 
  '__VIEWSTATEGENERATOR':viewgen,
  '__VIEWSTATEENCRYPTED':'',
  '__EVENTVALIDATION':eventval,
  'ctl00$MainContent$txtPermitNo':'', 
  'ctl00$MainContent$txtPermitName': '',
  'ctl00$MainContent$txtBrandName':'', 
  'ctl00$MainContent$txtPeriodBeginDt':'08/28/2017',
  'ctl00$MainContent$txtPeriodEndingDt':'11/25/2018',
  'ctl00$MainContent$btnSearch': 'Search'
}

with requests.Session() as s:
  s.headers["User-Agent"] = "Mozilla/5.0"
  req = s.post(url,data=payload,cookies=res.cookies.get_dict())
  sauce = BeautifulSoup(req.text,"lxml")
  for items in sauce.select("#MainContent_gvBRCSummary tr"):
    data = [item.get_text(strip=True) for item in items.select("th,td")]
    print(data)

对于解决此问题的任何帮助将受到高度赞赏。再说一遍:我想获取的数据是网站下一页的表格内容,因为我的脚本已经可以解析第一页的数据了?

P.S.: Browser simulator is not an option I would like to cope with.

1 个答案:

答案 0 :(得分:1)

您需要为每个页面添加一个循环,并将请求的页码分配给__EVENTARGUMENT参数,如下所示:

import requests
from bs4 import BeautifulSoup

url = "https://www.myfloridalicense.com/FLABTBeerPricePosting/"

res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")

try:
    evtrgt = soup.select_one("#__EVENTTARGET").get('value')
except AttributeError: 
    evtrgt = ""

viewstate = soup.select_one("#__VIEWSTATE").get('value')
viewgen = soup.select_one("#__VIEWSTATEGENERATOR").get('value')
eventval = soup.select_one("#__EVENTVALIDATION").get('value')

payload = {
    '__EVENTTARGET' : evtrgt,
    '__EVENTARGUMENT' : '',
    '__VIEWSTATE' : viewstate, 
    '__VIEWSTATEGENERATOR' : viewgen,
    '__VIEWSTATEENCRYPTED' : '',
    '__EVENTVALIDATION' : eventval,
    'ctl00$MainContent$txtPermitNo' : '', 
    'ctl00$MainContent$txtPermitName' : '',
    'ctl00$MainContent$txtBrandName' : '', 
    'ctl00$MainContent$txtPeriodBeginDt' : '08/28/2017',
    'ctl00$MainContent$txtPeriodEndingDt' : '11/25/2018',
    'ctl00$MainContent$btnSearch': 'Search'
}

for page in range(1, 12):
    with requests.Session() as s:
        s.headers["User-Agent"] = "Mozilla/5.0"
        payload['__EVENTARGUMENT'] = f'Page${page}'
        req = s.post(url,data=payload,cookies=res.cookies.get_dict())
        sauce = BeautifulSoup(req.text, "lxml")

        for items in sauce.select("#MainContent_gvBRCSummary tr"):
            data = [item.get_text(strip=True) for item in items.select("th,td")]
            print(data)