使用python的库'请求',我试图网页抓取需要先登录的ASPX网站(https://cei.bmfbovespa.com.br/CEI_Responsivo/home.aspx)(https://cei.bmfbovespa.com.br/CEI_Responsivo/login.aspx )。
以下是我尝试执行的步骤:
当我使用chrome手动执行此操作时,在成功登录后,我得到了一个' 302响应'我被重定向到主页。但是使用python,在POST之后,我得到了一个200的响应'我还在登录页面。
import requests
from bs4 import BeautifulSoup
from requests.packages.urllib3 import add_stderr_logger
add_stderr_logger()
s = requests.Session()
url_login = 'https://cei.bmfbovespa.com.br/CEI_Responsivo/login.aspx'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36',
'Upgrade-Insecure-Requests':'1',
'Host':'cei.bmfbovespa.com.br',
'Connection':'keep-alive',
'Accept-Language':'pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
}
s.headers.update(headers)
r = s.get(url_login, verify=False)
soup = BeautifulSoup(r.content)
viewstate = soup.find(id="__VIEWSTATE")['value']
viewgen = soup.find(id="__VIEWSTATEGENERATOR")['value']
eventvalid = soup.find(id="__EVENTVALIDATION")['value']
login_data = {
'__VIEWSTATE' : viewstate,
'__VIEWSTATEGENERATOR' : viewgen,
'__EVENTVALIDATION' : eventvalid,
'ctl00$ContentPlaceHolder1$txtLogin' : '*',
'ctl00$ContentPlaceHolder1$txtSenha' : '*',
'tl00$ContentPlaceHolder1$btnLogar': 'Entrar'
}
resp = s.post(url_login, data=login_data, verify=False)
如果我仍尝试使用会话进行GET,我会被重定向到登录页面:
url_carteira = 'https://cei.bmfbovespa.com.br/CEI_Responsivo/home.aspx'
response = s.get(url_carteira, verify=False)
这就是我收到的输出:
2016-02-11 22:07:07,476 INFO Starting new HTTPS connection (1): cei.bmfbovespa.com.br
2016-02-11 22:07:07,823 DEBUG "GET /CEI_Responsivo/login.aspx HTTP/1.1" 200 4522
2016-02-11 22:07:07,898 DEBUG "POST /CEI_Responsivo/login.aspx HTTP/1.1" 200 4534
C:\Users\luciano\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\packages\urllib3\connectionpool.py:791: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html
InsecureRequestWarning)
2016-02-11 22:07:10,470 DEBUG "GET /CEI_Responsivo/home.aspx HTTP/1.1" 302 147
2016-02-11 22:07:10,510 DEBUG "GET /CEI_Responsivo/login.aspx HTTP/1.1" 200 4522
我正在使用python 3.5.1
我知道为什么我无法成功登录并访问主页?