使用machanize处理非标准401标头

时间:2013-11-14 09:31:59

标签: python http web-scraping mechanize http-status-code-401

我正在尝试自动登录到此页面http://portal.globaltransit.net/当页面首次到达页面时页面返回401标题,但不显示standerd bassic http auth页面而非http表单。这是curl -vvv http://portal.globaltransit.net/

的输出
* About to connect() to portal.globaltransit.net port 80 (#0)
* Trying 124.158.236.65... connected
* Connected to portal.globaltransit.net (124.158.236.65) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.19.7 (i486-pc-linux-gnu) libcurl/7.19.7 OpenSSL/0.9.8k zlib/1.2.3.3 libidn/1.15
> Host: portal.globaltransit.net
> Accept: */*
>
< HTTP/1.1 401 Unauthorized
< Date: Thu, 14 Nov 2013 07:18:06 GMT
< Server: Apache
< X-Powered-By: PHP/5.2.11
< Set-Cookie: symfony=1960d9b76a5f9fc3b00786e126cc69af; path=/
< Content-Length: 1211
< Content-Type: text/html; charset=utf-8
<
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
        <title></title>
    <link rel="shortcut icon" href="/favicon.ico" />
    <link rel="stylesheet" type="text/css" media="screen" href="/css/main.css" />
      </head>
  <body>


<form action="/login" method="post">
  <table>
    <tr> 
  <th><label for="signin_username">Username</label></th>
  <td><input type="text" name="signin[username]" id="signin_username" /></td>
</tr>
<tr>
  <th><label for="signin_password">Password</label></th>
  <td><input type="password" name="signin[password]" id="signin_password" /></td>
</tr>
<tr>
  <th><label for="signin_remember">Remember</label></th>
  <td><input type="checkbox" name="signin[remember]" id="signin_remember" /><input type="hidden" name="signin[_csrf_token]" value="6bdf80ca900038ada394467752593135" id="signin__csrf_token" /></td>
</tr>
  </table>

  <input type="submit" value="sign in" />
  <a href="/request_password">Forgot your password?</a>
</form>
  </body>
</html>

当我尝试使用machanize加载带有以下脚本的页面时

import mechanize
import mimetypes
import logging
import urllib2
from urlparse import urlparse
import cookielib
from base64 import b64encode
class Browser:
    def __init__(self, url):
        br = mechanize.Browser()
        br.set_handle_robots(False)   # no robots
        br.set_handle_refresh(False)
        br.set_handle_redirect(True)
        br.set_debug_http(True)
        cj = cookielib.LWPCookieJar()
        br.set_cookiejar(cj)  # can sometimes hang without this
        br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
        self.page = br.open(url).read()
        print self.page
if __name__ == '__main__':
    browser = Browser("http://portal.globaltransit.net/")

我收到以下错误mechanize._response.httperror_seek_wrapper: HTTP Error 401: Unauthorized。我想知道是否有任何方法让机械化忽略服务器返回的401,以便我可以处理表单。

3 个答案:

答案 0 :(得分:3)

在mechanize.Browser._mech_open中引发错误,看着它似乎没有办法禁用错误。但是,一个人可以monkey patch这个功能;将function in the mechanize source复制到您的源代码中,然后将if not success:替换为if not sucess and response.getcode() != 401:,这样如果错误为401,它仍会返回响应。然后猴子使用mechanize.Browser._mech_open = _mech_open修补该功能。在导入模块后立即执行此操作。

答案 1 :(得分:3)

我想你可能会做这样的事情:

try:
    response = mechanize.urlopen("http://portal.globaltransit.net/")
except mechanize.HTTPError, response:
    pass

body = response.read()
#Do stuff with the form in the response body

答案 2 :(得分:0)

有一种方法实际上是从服务器获得非401响应。

首先: 试试这个

for form in br.forms():
   print "Form name:", form.name
   print form

响应如下:

Form name: None
<POST http://portal.globaltransit.net/login application/x-www-form-urlencoded
<TextControl(signin[username]=)>
<PasswordControl(signin[password]=)>
<CheckboxControl(signin[remember]=[on])>
<HiddenControl(signin[_csrf_token]=ec9a290dcc8d71e458d31a0fd509376b) (readonly)>
<SubmitControl(<None>=sign in) (readonly)>>

在这里,您可以看到返回的实际网址与您在代码中使用的网址略有不同。

现在做:

response = br.open("http://portal.globaltransit.net/login application/x-www-form-urlencoded")

回复是:

send: 'GET /login application/x-www-form-urlencoded HTTP/1.1\r\nAccept-Encoding:  
identity\r\nHost: portal.globaltransit.net\r\nCookie:  
symfony=f6fa25cf26e310e7e8bb3170637fdd73\r\nConnection: close\r\nUser-Agent:  
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 
Firefox/3.0.1\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Tue, 26 Nov 2013 17:45:01 GMT
header: Server: Apache
header: X-Powered-By: PHP/5.2.11
header: Content-Length: 1211
header: Connection: close
header: Content-Type: text/html; charset=utf-8

此处的关键是使用从 br.form 名称返回的网址。然后你可以像往常一样使用表格。

这里可以找到一个很好的指南 http://www.pythonforbeginners.com/cheatsheet/python-mechanize-cheat-sheet/