从.aspx网站跨多个页面显示的抓取结果

时间:2019-07-31 19:29:48

标签: python python-3.x web-scraping python-requests

我正在尝试抓取该网站:

website address

如果我手动搜索A,我会看到结果分散在多个页面上,但是当我尝试使用下面的脚本来获取结果时,我会从第一页重复获取结果:

我尝试过:

import requests
from bs4 import BeautifulSoup

url = 'http://www.occeweb.com/MOEAsearch/index.aspx'

session = requests.Session()
r = session.get(url)
soup = BeautifulSoup(r.text,'lxml')
for page in range(1,3):
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['txtSearch'] = 'A'
    payload['__EVENTTARGET'] = 'gvResults'
    payload['__EVENTARGUMENT'] = f'Page${page}'
    res = session.post(url,data=payload)
    soup = BeautifulSoup(res.text,"lxml")
    for items in soup.select("#gvResults tr")[1:2]:
        data = [item.get_text(strip=True) for item in items.select("td")]
        print(data)

我如何也可以从其他页面获得结果?

1 个答案:

答案 0 :(得分:2)

您的问题发生在下面的行

payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}

当您进行第二页搜索时,它将发送额外的有效负载btnSearch,这将使其变为搜索操作而不是下一页操作

修复很简单,下面是更新的代码

import requests
from bs4 import BeautifulSoup

url = 'http://www.occeweb.com/MOEAsearch/index.aspx'

session = requests.Session()
r = session.get(url)
soup = BeautifulSoup(r.text,'lxml')
for page in range(1,3):
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['txtSearch'] = 'A'
    payload['__EVENTTARGET'] = 'gvResults'
    payload['__EVENTARGUMENT'] = f'Page${page}'
    if page > 1:
       payload.pop('btnSearch')
    res = session.post(url,data=payload)
    soup = BeautifulSoup(res.text,"lxml")
    for items in soup.select("#gvResults tr")[1:2]:
        data = [item.get_text(strip=True) for item in items.select("td")]
        print(data)