我整天都在努力解决这个问题。我需要从网站上抓取一个数据,该网站有一个按钮,您需要点击该按钮才能查看数据。 Button本身调用了ASP.NET网站使用的这个着名的__dopostback()javascript函数
<a id="ContentPlaceHolder1_lbCoach" class="btn btn-dark-blue" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$lbCoach','')"><i class="fa fa-eye"></i> Display HS Coach Info</a>
正如this回答所暗示的那样,我应该模仿发布请求的行为,我应该恢复数据,我只是做了以下内容:
VIEWSTATE = soup.find('input',{'id':'__VIEWSTATE'}).get('value')
EVENTVALIDATION = soup.find('input',{'id':'__EVENTVALIDATION'}).get('value')
headers = {'Cache-Control': 'no-cache',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
'Referer': contact_url,
'X-MicrosoftAjax': 'Delta=true'}
payload = {"ctl00$ToolkitScriptManager2":"ctl00$ContentPlaceHolder1$updCoach|ctl00$ContentPlaceHolder1$lbCoach",
"ToolkitScriptManager2_HiddenField":"",
"ctl00$Header1$Menu1$txtSearchBox": "",
"ctl00$Header1$Menu1$txtSearchBox2": "",
"__EVENTTARGET":"ctl00$ContentPlaceHolder1$lbDisplayContact",
"__EVENTARGUMENT":"",
"__VIEWSTATE":VIEWSTATE,
"__SCROLLPOSITIONX":"0",
"__SCROLLPOSITIONY":"0",
"__EVENTVALIDATION":EVENTVALIDATION,
"__ASYNCPOST": "true",
}
r = s.post(contact_url,headers = headers, data=payload)
page_content = r.content.decode()
soup = BeautifulSoup(page_content, "html.parser")
反应似乎很好,但我得到的并不特别:
b'1|#||4|40|updatePanel|ContentPlaceHolder1_Bio1_udpAdminMenu|\r\n \r\n |0|hiddenField|__EVENTTARGET||0|hiddenField|__EVENTARGUMENT||16992|hiddenField|__VIEWSTATE||1|hiddenField|__SCROLLPOSITIONX|0|1|hiddenField|__SCROLLPOSITIONY|0|292|hiddenField|__EVENTVALIDATION|/wEdAAxsD18kXuyPL5ofgcnYES9y+7zziCikaDB50o6O1pxxXbDWcw39S27yDoDwzfIvSl/82S52cVbB2NeFUXKE4Mx+O+TegoiNwQAdWnT22jPmzI4v73G0IN877PxHm4GlN3cV9hFWoAb20O4Q+9Ls96AskeglIWLjtf4N+HDDRWBUXzFl5Dm8D+CLbHmC0vzJAV2dMNOfX5+XKgQp7nrLXr1R1UFtN09quhqZEMqLAngnkseO4VALrQwmvGPQfIrd43K9AvIrswshyn58y8V7WKC8hka6Yg==|0|asyncPostBackControlIDs|||0|postBackControlIDs|||285|updatePanelIDs||tctl00$ContentPlaceHolder1$Bio1$udpAdminMenu,ContentPlaceHolder1_Bio1_udpAdminMenu,tctl00$ContentPlaceHolder1$udpAddress,ContentPlaceHolder1_udpAddress,tctl00$ContentPlaceHolder1$updCoach,ContentPlaceHolder1_updCoach,tctl00$ContentPlaceHolder1$updDetails,ContentPlaceHolder1_updDetails|0|childUpdatePanelIDs|||81|panelsToRefreshIDs||ctl00$ContentPlaceHolder1$Bio1$udpAdminMenu,ContentPlaceHolder1_Bio1_udpAdminMenu|2|asyncPostBackTimeout||90|48|formAction||./PlayerProfile_ContactInfo.aspx?ID=J34665D097ED|'
当我使用Fiddler时,请求和响应,单击实际按钮和代码之后的那个,似乎是相同的。
最有趣的部分,相同的请求,通过Chrome Dev工具正常呈现并代替\r\n \r\n
来自流行响应,现在您可以看到整个html,包含所有其他数据
是否有可能,我实际上是在获取数据,但不知道如何渲染数据?
答案 0 :(得分:0)
要抓取这种类型的网页,您必须按 F12 键,然后转到“网络”选项卡,最后单击其中一个页面以更改您的页面,您可以看到所有请求。
可能第一个请求是更改页面。
点击它。 您必须从页面中获取所有可用字段,并将它们与您的 Python 请求一起提交以更改页面。
我做了下面照片的步骤
Python 代码:
import requests,re
firstPage = requests.get('https://Site/Your-Page').text
soup = BeautifulSoup(firstPage,'html.parser')
VIEWSTATEGENERATOR = soup.find('input',{'id':'__VIEWSTATEGENERATOR'}).get('value')
VIEWSTATE = soup.find('input',{'id':'__VIEWSTATE'}).get('value')
for page in range(1,5): ### I read pages one to four ###
data = {
"__EVENTTARGET": "rptPager$ctl0{}$lnkPage".format(page),
"__EVENTARGUMENT": "",
"__LASTFOCUS": "",
"__VIEWSTATE":VIEWSTATE,
"ddlPageSize": 24,
"__VIEWSTATEGENERATOR": VIEWSTATEGENERATOR
}
res = requests.post('https://Site/Your-Page',data=data).content.decode('utf8')
print(res)
此代码只能获取前几页,如果需要其他页面则需要在Python请求中添加另一个参数
___________________________________________ if parameter = 1 page number: 01 02 03 04 05 06 07 08 09 ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ "refers to" pagination: prev 1 2 3 4 5 6 7 8 9 next ------------------------------------------- if parameter = 2 page number: 10 11 12 13 14 15 16 17 18 ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ pagination: prev 1 2 3 4 5 6 7 8 9 next ------------------------------------------- if parameter = 3 page number: 19 20 21 22 23 24 25 26 27 ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑ pagination: prev 1 2 3 4 5 6 7 8 9 next ___________________________________________
上面的参数可能在“__VIEWSTATE”里面或者...