编辑：

Question

我正在尝试为Web应用程序编写一些测试脚本。所以我尝试使用twill，结果是使用机械化来解析html。但它真的让我失望。例如，不知何故，它无法正确识别网页上的表单需要方法“POST”，而不是“GET”。

那么，除了直接使用urllib2之外还有其他更好的选择吗？

编辑，这个斜纹无法辨认的表格。

<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML Transitional//EN'
'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'>
<html>
<head>
<meta http-equiv='Content-type' content='text/html; charset=utf-8' />
<title> Login  </title>
<link rel='stylesheet' href='/assets/styles/default.css' type='text/css'/>
<link rel='stylesheet' href='/assets/styles/button.css' type='text/css'/>

<link rel='stylesheet' href='/assets/styles/login.css' type='text/css'/>

<link rel='icon' type='image/x-icon' href='/assets/favicon.ico' /> 
</head>
<body >

<div id='content_area'>

<div id='outter_login_area'>
    <div id='inner_login_area'>
        <div id='login_head'>
            <div>
            Login with your
            </div>
            <div id='login_head_bottom'>
                <img src='/assets/images/header_logo.png' class='login_logo'/>
                <b>Admin Account</b>
            </div>
        </div>
        <hr/>
        <form action="." method="POST" id='user_login_form'>
            <div style='display:none'><input type='hidden' name='csrfmiddlewaretoken' value='c6b6e0ca08d53093428c61f62f51ea1f' /></div>

            <div>
                <label for="id_username">User Name</label>
                <input id="id_username" type="text" name="username" maxlength="30" />

            </div>
            <div>
                <label for="id_password">Password</label>
                <input type="password" name="password" id="id_password" />

            </div>
            <div id='submit_bar'>
                <input name='submit_button' type="submit" value="Submit" class="button blue"/>
            </div>
        </form>
    </div>
</div>

</div>
</body>
</html>

这是斜纹的说法：

In [5]: br.go('http://localhost:8000/')
==> at http://localhost:8000/accounts/login/?next=/chancellor/

In [6]: br.get_all_forms() 
Out[6]: [<_mechanize_dist.ClientForm.HTMLForm instance at 0x03112F80>]

In [7]: br.get_all_forms()[0] 
Out[7]: <_mechanize_dist.ClientForm.HTMLForm instance at 0x03112F80>

In [8]: br.get_all_forms()[0].method 
Out[8]: 'GET'

Answer 1

嗯，那不是真的。

mechanize可以识别表单的方法是POST还是GET。它会查找"method"标记的<form>属性。

所以，如果它对你的特定情况不起作用，你必须看看出了什么问题。您能为您尝试使用的页面提供HTML源代码吗？我怀疑表单未声明为POST，否则mechanize会检测到它。

也就是说，如果您正在寻找替代方案，我希望使用scrapy进行网页抓取。它是一个快速的高级屏幕抓取和网络爬行框架，从头开始编写，目的是抓取网站从页面中提取结构化数据。它可用于各种用途，从数据挖掘到监控和自动化测试。

编辑：

我将您的html代码段保存到/tmp/test.html并运行以下代码：

import mechanize
br = mechanize.Browser()
br.open('file:///tmp/test.html')
br.select_form(nr=0)
print br.method

我得到POST作为结果。所以我无法重现您的问题。

你确定这是你正在解析的页面吗？

编辑2：

您的HTML已损坏。紧接在<form>之上的行格式不正确。它包含<hr/>，它应包含<hr>或<hr />，以便正确解析。

以下是解析时的解决方法：

import mechanize
br = mechanize.Browser()
response = br.open('file:///tmp/test.html')

# fix the page so it is correctly parsed:
response.set_data(response.get_data().replace('<hr/>', '<hr />'))
br.set_response(response)

br.select_form(nr=0)
print br.method

是否有任何库提供与斜纹和机械化相似的功能，但质量更好

1 个答案:

编辑：

编辑2：