身份验证网站抓取问题

时间:2019-02-26 15:57:48

标签: python web-scraping beautifulsoup python-requests

完成所有这些操作后,我不知道为什么无法发布会话。

我试图查看是否错过了表单中的任何信息,例如隐藏的令牌,但似乎他们甚至都没有表单。

有人可以指出我的方向吗?提前谢谢你

import requests
from bs4 import BeautifulSoup

username = myUserName
password = myPassword

scrape_url = 'https://ags.aspengrove.net/Property/PropertySummary.aspx?PropertyID=1366919'


login_url = 'https://ags.aspengrove.net/Library/Security/Login.aspx?ReturnUrl=%2fIndex.aspx'
login_info = {'ctl01$MainContent$tbxPerson': username,'ctl01$MainContent$tbxPassword': password}

#Start session.
session = requests.session()

#Login
r = session.post(url=login_url, data=login_info)

#Request page you want to scrape.
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"
}

url = session.get(url=scrape_url,headers=headers)

soup = BeautifulSoup(url.content, 'html.parser')

print(r.status_code)


for td in soup.findAll('td'):
  print('\n\n\n')
  print('text: ' + str(td.text))

session.close()

1 个答案:

答案 0 :(得分:2)

开始查看这样的页面的最佳方法是简单地发出POST请求并查看它的作用。在Chrome中,POST数据似乎是以下字段:

ctl01_ctlScriptManager_HiddenField: 
__EVENTTARGET: ctl01$MainContent$btnLogin
__EVENTARGUMENT: 
__VIEWSTATE__: RU7PS8MwGKVZQ91AcitebHPYYQOVzR842M2hFx2MKl5L2nztwtJEk9S5v17TMvHyvsd773u8n4CcBTjJ85VWzmhpM/hshYGNtu6BlbtnOOR5HOC0dHI2H6+ZUF0SlBuX210GDTQFmDUQPLqc3y9mi+ubu1sSh8noSRjrXnQtFMYVkxaS0wwqMEaoesNq4DGqiAc06DH0GPb8BAWkN2OURO/CikJCzb0VoxR5Ev2RYf9yDHcdAel+wjf4dvji0a809KBbQ6FhQlLGuQFrKVOcfjBr99pwWoDU+yvOjyuC/550AF7GvTAk3UkirUopyh0+N+Bao+ikcOqVfUG+6uSJ2wo7nS75Lw==
__VIEWSTATE: 
__EVENTVALIDATION: Z+yHsUlIzPcsXdpj1bBqQkEDPqzzZPfBKwo/SI3nW5r4vyVU240IulzvcQOvQ5FLpkCLPwPUhdDRs0dGzhW3VQyWQjAjktxQ6FbmHS6dY0bEhbG6hkPAIxF3rEfHyQpnmuCflUGUC0HWxtr8LNx1oiUzOSrdrMhLuCLvWi01mvoc7vnsES6K97wbg1AUfun/Z2062CHFXbUcQYyr1KBLwVs13Y6FWr+e3Ruyb5EaftqQOSbtSRg8ZP1zE1aj05qY4tWBlG7hCIfl00xq6n6Zv0q6p9WrbkPdUv6/Gw==
ctl01$TimeOffset: 
ctl01$MainContent$hidPassExpression: /^.*(?=.*\d)(?=.*[a-z])(?=.*[@#$%^*!_=?:|,()-]).*$/
ctl01$MainContent$hidPassLength: 8
ctl01$MainContent$hidPassCode: 
ctl01$MainContent$tbxPerson: abcdef@efghij.com
ctl01$MainContent$tbxPassword: a@@@@@@@@1

这是一个ASP.net页面,因此有很多收获。正确的做法是查看整个登录页面并匹配元素。确定字段外观的快速(但很脏)方法只是让bs4抓取所有输入标签。

import bs4
import requests

headers = {"user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"}

r = requests.get("https://ags.aspengrove.net/Library/Security/Login.aspx?ReturnUrl=%2fProperty%2fPropertySummary.aspx%3fPropertyID%3d1366919&PropertyID=1366919", headers=headers)
soup = bs4.BeautifulSoup(r.text)
itags = soup.find_all(name="input")
for tag in itags:
    print(tag)

结果看起来像

<input id="ctl01_ctlScriptManager_HiddenField" name="ctl01_ctlScriptManager_HiddenField" type="hidden" value=""/>
<input id="__EVENTTARGET" name="__EVENTTARGET" type="hidden" value=""/>
<input id="__EVENTARGUMENT" name="__EVENTARGUMENT" type="hidden" value=""/>
<input id="__VIEWSTATE__" name="__VIEWSTATE__" type="hidden" value="RU7PS8MwGKVZQ91AcitebHPYYQOVzR842M2hFx2MKl5L2nztwtJEk9S5v17TMvHyvsd773u8n4CcBTjJ85VWzmhpM/hshYGNtu6BlbtnOOR5HOC0dHI2H6+ZUF0SlBuX210GDTQFmDUQPLqc3y9mi+ubu1sSh8noSRjrXnQtFMYVkxaS0wwqMEaoesNq4DGqiAc06DH0GPb8BAWkN2OURO/CikJCzb0VoxR5Ev2RYf9yDHcdAel+wjf4dvji0a809KBbQ6FhQlLGuQFrKVOcfjBr99pwWoDU+yvOjyuC/550AF7GvTAk3UkirUopyh0+N+Bao+ikcOqVfUG+6uSJ2wo7nS75Lw=="/>
<input id="__VIEWSTATE" name="__VIEWSTATE" type="hidden" value=""/>
<input id="__EVENTVALIDATION" name="__EVENTVALIDATION" type="hidden" value="Nw7wmof2VXeD0/HsHnbqEV3JYs/jUm1FUFYbO2NxwJVUOXSdi+ulpjvZ501wLkSCJVkUlTOMNkaCw9d+fr74I9lkObY9N2zwbqbcEcac6af8hP5vblYExcMszLJNqOrAuNPqRUjsV91y5/PPekrgOuvM1O1ep5kvpzMfljrCLngSTNYbU9iEruOYL29RwQPz4+521uAjowigFf7fCEYTaqfuJZrML5WYNKW7eu7KxyxeEXpjG1K+Ufxxs7X1PTU3XoYw+qkUYp1RexvoCgdFlCkbZstCiOpU8PI5TA=="/>
<input id="ctl01_TimeOffset" name="ctl01$TimeOffset" type="hidden"/>
<input id="ctl01_MainContent_hidPassExpression" name="ctl01$MainContent$hidPassExpression" type="hidden" value="/^.*(?=.*\d)(?=.*[a-z])(?=.*[@#$%^*!_=?:|,()-]).*$/"/>
<input id="ctl01_MainContent_hidPassLength" name="ctl01$MainContent$hidPassLength" type="hidden" value="8"/>
<input id="ctl01_MainContent_hidPassCode" name="ctl01$MainContent$hidPassCode" type="hidden"/>
<input class="TextBox" id="ctl01_MainContent_tbxPerson" name="ctl01$MainContent$tbxPerson" size="50" type="text"/>
<input class="TextBox" id="ctl01_MainContent_tbxPassword" name="ctl01$MainContent$tbxPassword" size="30" type="password"/>
<input class="button" id="ctl01_MainContent_btnLogin" name="ctl01$MainContent$btnLogin" onclick="this.disabled=true; this.value = 'Logging In';__doPostBack('ctl01$MainContent$btnLogin','')" type="button" value="Login"/>
<input id="ctl01_MainContent_chkRememberMe" name="ctl01$MainContent$chkRememberMe" type="checkbox"/>

您也可以通过遍历列表来获取所有名称字段

for tag in itags:
    print(tag["name"])

电子邮件/密码和__EVENTTARGET,这只是提交按钮输入的名称。

从那里,您应该能够为登录提交正确的POST数据。

相关问题