我已经使用python创建了一个脚本来解析遍历多个页面的不同项目的链接。为了解析其着陆页中的链接,get
请求也有效,因此我在第一页使用了get
请求。
但是,需要发出带有适当参数的发布请求才能从下一页获取链接。我也这样做了。该脚本现在可以解析最多11页的链接。 问题出现在12页之后,依此类推 。该脚本不再起作用。我尝试使用不同的页面,例如20,50,100,150。没有结果。
我尝试过:
import time
import requests
from bs4 import BeautifulSoup
res_url = 'https://www.brcdirectory.com/InternalSite//Siteresults.aspx?'
params = {
'CountryId': '0',
'CategoryId': '49bd499b-bc70-4cac-9a29-0bd1f5422f6f',
'StandardId': '972f3b26-5fbd-4f2c-9159-9a50a15a9dde'
}
with requests.Session() as s:
page = 11
while True:
print("**"*5,"trying with page:",page)
req = s.get(res_url,params=params)
soup = BeautifulSoup(req.text,"lxml")
if page==1:
for item_link in soup.select("h4 > a.colorBlue[href]"):
print(item_link.get("href"))
else:
payload = {i['name']:i.get('value') for i in soup.select('input[name]')}
payload['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$gv_Results'
payload['__EVENTARGUMENT'] = f"{'Page$'}{page}"
payload['ctl00$ContentPlaceHolder1$ddl_SortValue'] = 'SiteName'
res = s.post(res_url,params=params,data=payload)
sauce = BeautifulSoup(res.text,"lxml")
if not sauce.select("h4 > a.colorBlue[href]"):break
for elem_link in sauce.select("h4 > a.colorBlue[href]"):
print(elem_link.get("href"))
page+=1
time.sleep(3)
如何使用请求在11个页面后抓取链接?
答案 0 :(得分:2)
我认为您的抓取逻辑是正确的,但是在循环中,您每次都执行GET + POST,而您应该第一次执行GET,然后为下一次迭代发布POST(如果您希望一次迭代= 1页)
一个例子:
import requests
from bs4 import BeautifulSoup
res_url = 'https://www.brcdirectory.com/InternalSite//Siteresults.aspx?'
params = {
'CountryId': '0',
'CategoryId': '49bd499b-bc70-4cac-9a29-0bd1f5422f6f',
'StandardId': '972f3b26-5fbd-4f2c-9159-9a50a15a9dde'
}
max_page = 20
def extract(page, soup):
for item_link in soup.select("h4 a.colorBlue"):
print("for page {} - {}".format(page, item_link.get("href")))
def build_payload(page, soup):
payload = {}
for input_item in soup.select("input"):
payload[input_item["name"]] = input_item["value"]
payload["__EVENTTARGET"]="ctl00$ContentPlaceHolder1$gv_Results"
payload["__EVENTARGUMENT"]="Page${}".format(page)
payload["ctl00$ContentPlaceHolder1$ddl_SortValue"] = "SiteName"
return payload
with requests.Session() as s:
for page in range(1, max_page):
if (page > 1):
req = s.post(res_url, params = params, data = build_payload(page, soup))
else:
req = s.get(res_url,params=params)
soup = BeautifulSoup(req.text,"lxml")
extract(page, soup)