如何网页废弃需要身份验证的ASPX页面

时间:2016-02-12 00:24:22

标签: python asp.net python-3.x web-scraping python-requests

使用python的库'请求',我试图网页抓取需要先登录的ASPX网站(https://cei.bmfbovespa.com.br/CEI_Responsivo/home.aspx)(https://cei.bmfbovespa.com.br/CEI_Responsivo/login.aspx )。

以下是我尝试执行的步骤:

  1. 使用'请求创建会话'处理cookie(它做对吗?)
  2. 使用我从中获取的所有信息更新标题"请求标题"使用Chrome开发工具(因会话而导致的Cookie信息除外)
  3. 在登录页面中执行GET以获取POST的输入值
  4. POST
  5. 当我使用chrome手动执行此操作时,在成功登录后,我得到了一个' 302响应'我被重定向到主页。但是使用python,在POST之后,我得到了一个200的响应'我还在登录页面。

    import requests
    from bs4 import BeautifulSoup
    from requests.packages.urllib3 import add_stderr_logger
    
    add_stderr_logger()
    
    s = requests.Session()
    
    url_login = 'https://cei.bmfbovespa.com.br/CEI_Responsivo/login.aspx'
    
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36',
        'Upgrade-Insecure-Requests':'1',
        'Host':'cei.bmfbovespa.com.br',
        'Connection':'keep-alive',
        'Accept-Language':'pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.4',
        'Accept-Encoding':'gzip, deflate, sdch',
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
    }
    s.headers.update(headers)
    
    r = s.get(url_login, verify=False)
    soup = BeautifulSoup(r.content)
    
    viewstate = soup.find(id="__VIEWSTATE")['value']
    viewgen = soup.find(id="__VIEWSTATEGENERATOR")['value']
    eventvalid = soup.find(id="__EVENTVALIDATION")['value']
    
    login_data = {          
            '__VIEWSTATE' : viewstate,
            '__VIEWSTATEGENERATOR' : viewgen,
            '__EVENTVALIDATION' : eventvalid,
            'ctl00$ContentPlaceHolder1$txtLogin' : '*',
            'ctl00$ContentPlaceHolder1$txtSenha' : '*',
            'tl00$ContentPlaceHolder1$btnLogar': 'Entrar'
    }
    
    resp = s.post(url_login, data=login_data, verify=False)
    

    如果我仍尝试使用会话进行GET,我会被重定向到登录页面:

    url_carteira = 'https://cei.bmfbovespa.com.br/CEI_Responsivo/home.aspx'
    response = s.get(url_carteira, verify=False)
    

    这就是我收到的输出:

    2016-02-11 22:07:07,476 INFO Starting new HTTPS connection (1): cei.bmfbovespa.com.br
    2016-02-11 22:07:07,823 DEBUG "GET /CEI_Responsivo/login.aspx HTTP/1.1" 200 4522
    2016-02-11 22:07:07,898 DEBUG "POST /CEI_Responsivo/login.aspx HTTP/1.1" 200 4534
    C:\Users\luciano\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\packages\urllib3\connectionpool.py:791: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html
      InsecureRequestWarning)
    2016-02-11 22:07:10,470 DEBUG "GET /CEI_Responsivo/home.aspx HTTP/1.1" 302 147
    2016-02-11 22:07:10,510 DEBUG "GET /CEI_Responsivo/login.aspx HTTP/1.1" 200 4522
    

    我正在使用python 3.5.1

    我知道为什么我无法成功登录并访问主页?

0 个答案:

没有答案
相关问题