我正在尝试抓取该网站:
如果我手动搜索A
,我会看到结果分散在多个页面上,但是当我尝试使用下面的脚本来获取结果时,我会从第一页重复获取结果:
我尝试过:
import requests
from bs4 import BeautifulSoup
url = 'http://www.occeweb.com/MOEAsearch/index.aspx'
session = requests.Session()
r = session.get(url)
soup = BeautifulSoup(r.text,'lxml')
for page in range(1,3):
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
payload['txtSearch'] = 'A'
payload['__EVENTTARGET'] = 'gvResults'
payload['__EVENTARGUMENT'] = f'Page${page}'
res = session.post(url,data=payload)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("#gvResults tr")[1:2]:
data = [item.get_text(strip=True) for item in items.select("td")]
print(data)
我如何也可以从其他页面获得结果?
答案 0 :(得分:2)
您的问题发生在下面的行
payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
当您进行第二页搜索时,它将发送额外的有效负载btnSearch
,这将使其变为搜索操作而不是下一页操作
修复很简单,下面是更新的代码
import requests
from bs4 import BeautifulSoup
url = 'http://www.occeweb.com/MOEAsearch/index.aspx'
session = requests.Session()
r = session.get(url)
soup = BeautifulSoup(r.text,'lxml')
for page in range(1,3):
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
payload['txtSearch'] = 'A'
payload['__EVENTTARGET'] = 'gvResults'
payload['__EVENTARGUMENT'] = f'Page${page}'
if page > 1:
payload.pop('btnSearch')
res = session.post(url,data=payload)
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("#gvResults tr")[1:2]:
data = [item.get_text(strip=True) for item in items.select("td")]
print(data)