使用BeautifulSoup和Mechanize登录网页

时间:2015-04-09 12:07:48

标签: python web-scraping beautifulsoup mechanize

我正在尝试使用BeautifulSoup和Mechanize以编程方式登录网页。

这是我的代码:

#import urllib2
from mechanize import Browser, _http, urlopen
from BeautifulSoup import BeautifulSoup
import cookielib

data_url = "http://data.theice.com/ViewData/EndOfDay/LdnOptions.aspx?p=AER"

def are_we_logged_on(html):
    soup = BeautifulSoup(html)
    elem = soup.find("input", {"id" : "ctl00_ContentPlaceHolder1_LoginControl_m_userName" } )
    return elem is None


# Browser
br = Browser()

# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(True)
#br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(_http.HTTPRefreshProcessor(), max_time=1)

# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0')]

# The site we will navigate into, handling it's session
response = br.open(data_url)
html = response.get_data()

# do we need to log in?
logged_on = are_we_logged_on(html)


if not logged_on :
    print "DEBUG: Attempting to log in ..."
    # Select the first (index zero) form
    br.select_form(nr=0)

    # User credentials
    br.form['ctl00$ContentPlaceHolder1$LoginControl$m_userName'] = 'username'
    br.form['ctl00$ContentPlaceHolder1$LoginControl$m_password'] = 'password'

    # Login
    post_url, post_data, headers =  br.form.click_request_data()
    print post_url
    print post_data
    print headers
    resp = urlopen(post_url, post_data)

    # Check if login succesful
    html2 = resp.read()
    logged_on = are_we_logged_on(html2)

    if not logged_on:
        with open("icedump_fail.html","w") as f:
            f.write(html2)        
        print "DEBUG: Failed to logon. Aborting script ...!"
        exit(-1)


# If we got this far, then we are logged in ...

当我运行脚本时,执行路径总是会导致"无法登录"消息被打印到屏幕上。

有人能发现我可能做错了什么吗?我是新鲜的想法,也许需要一双新鲜的眼睛。

2 个答案:

答案 0 :(得分:3)

启用“调试”模式(br.set_debug_http(True))帮助我检查发送登录表单的基础请求mechanize,并将其与您使用登录时发送的实际请求进行比较浏览器。

这表明__EVENTTARGET参数是空的,而不应该是。

以下是帮助我解决问题的代码的固定部分:

br.select_form(nr=0)
br.form.set_all_readonly(False)

br.form['ctl00$ContentPlaceHolder1$LoginControl$m_userName'] = 'username'
br.form['ctl00$ContentPlaceHolder1$LoginControl$m_password'] = 'password'
br.form['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$LoginControl$LoginButton'

# Login
response = br.submit()
html2 = response.read()
logged_on = are_we_logged_on(html2)

作为旁注,请确保您没有违反协议,即“数字签名”while registering at "ICE"

  

刮痧:

     

为了从本网站自动提取数据,严禁抓取本网站   通过ICE,应该注意到这个过程可能导致a   消耗ICE的系统资源。 ICE(或其附属公司,代理商或   承包商)可以监控本网站的使用情况以进行刮擦   并可能采取一切必要措施以确保访问此权限   网站从执行或合理相信的实体中删除   进行网络抓取活动。

答案 1 :(得分:0)

我会使用Selenium,因为它功能齐全,功能更强大。您实际上也可以看到结果:

from selenium import webdriver

chrome = webdriver.Chrome()
chrome.get('http://data.theice.com/ViewData/EndOfDay/LdnOptions.aspx?p=AER')

user = chrome.find_element_by_name('ctl00$ContentPlaceHolder1$LoginControl$m_userName')
pswd = chrome.find_element_by_name('ctl00$ContentPlaceHolder1$LoginControl$m_password')
form = chrome.find_element_by_name('ctl00_ContentPlaceHolder1_LoginControl_LoginButton')

user.send_keys(your_username_string)
pswd.send_keys(your_password_string)
form.click() # hit the login button