表格来源:
<!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en-US">
<head>
<meta http-equiv="pragma" content="no-cache">
<meta http-equiv='cache-control' content='no-cache'>
<meta http-equiv='cache-control' content='no-store'>
<meta http-equiv='cache-control' content='max-age=0'>
<meta http-equiv='expires' content='0'>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta http-equiv="content-style-type" content="text/css">
<meta http-equiv="content-script-type" content="text/javascript">
<link rel="stylesheet" href="../../css/common.css" type="text/css">
<link rel="stylesheet" href="../../css/login.css" type="text/css">
<script type="text/javascript">
<!--
function popupCopyright() {
window.open( "copyright.html", "", "width=350,height=300" );
}
//-->
</script>
<title>KONICA MINOLTA PageScope Web Connection</title>
</head>
<body>
<div class="page_body">
<div class="page_top">
<a class="top_logo" href="http://konicaminolta.net" target="_blank"><img src="logo_companyL.gif" alt="KONICA MINOLTA Logo"></a>
<a class="top_logo" href="copyright.html" onclick="popupCopyright();return false;" onkeypress="popupCopyright();return false;"><img src="logo_utilityL.gif" alt="PageScope Web Connection Logo"></a>
<div class="tab_footer"></div>
</div>
<form name="lang_link" action="index.html" method="post" enctype="application/x-www-form-urlencoded">
<input type="hidden" name="lang">
</form>
<form action="index.cgi" method="post" enctype="application/x-www-form-urlencoded">
<input type="hidden" name="lang" value="1">
<div class="page_menu">
<div class="page_footer">
</div>
</div>
<div class="page_main">
<h1 class="title">Language</h1>
<select onchange="document.lang_link.lang.value=this.value;document.lang_link.submit();" name="linklist">
<option value="1" selected>English (English)</option>
<option value="2">Français (French)</option>
<option value="3">Italiano (Italian)</option>
<option value="4">Deutsch (German)</option>
<option value="5">Español (Spanish)</option>
<option value="6">Português (Portuguese)</option>
<option value="10">Čeština (Czech)</option>
<option value="12">Polski (Polish)</option>
<option value="14">Русский (Russian)</option>
<option value="15">Nederlands (Dutch)</option>
<option value="23">日本語 (Japanese)</option>
<option value="7">한국어 (Korean)</option>
<option value="8">简体中文 (Chinese-Simplified)</option>
<option value="9">繁體中文 (Chinese-Traditional)</option>
</select>
<hr>
<h1 class="title">Log in</h1>
<dl class="main1">
<dt class="check1"><input type="radio" name="reg" value="1" id="public"></dt>
<dd class="check1"><label for="public">Public User</label></dd>
</dl>
<dl class="main1">
<dt class="check1"><input type="radio" name="reg" value="4" id="admin"></dt>
<dd class="check1"><label for="admin">Administrator</label></dd>
</dl>
<p class="attention">SSL is not set-up. Please set up SSL after admin logins to secure safety of the information.</p>
<div class="page_footer">
<hr class="page_boader">
<input type="submit" value="Log in">
<input type="reset" value="Clear">
</div>
</div>
</form>
</div>
</body>
</html>
我只是想跟踪scrapy docs进行身份验证,并且无法让它正确地模拟登录。
当前代码(来自StackoverFlow用户Acorn的使用示例):
import scrapy
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from langmeters.items import LangItem
class LangSpider(InitSpider):
name = "lang"
allowed_domains = []
login_page = 'http://192.168.3.189/index.html'
start_urls = [
"http://192.168.3.189/m_s_dev.html", "http://192.168.3.189/m_s_cnt_total.html"
]
def init_request(self):
return Request(url=self.login_page, callback=self.login)
def login(self, response):
return FormRequest.from_response(response,
formdata={'reg' : '1'},
callback=self.check_login_response)
def check_login_response(self, response):
if "C3100P" in response.body:
self.log("Successfully logged in. Crawling may start")
self.initialized()
print "finished"
else:
self.log("Failed Login!")
def parse(self, response):
item = LangItem()
item['cmeter'] = response.xpath('//dt[contains(p, "Engine")]/following-sibling::dd/text()').extract()
我每次登录失败,所以我显然没有发送正确的输入。我在提交过程中注意到我得到一个“index.cgi”地址,如果我不选择公共或管理员,只返回index.html,只选择在没有任何一个的情况下登录。
注意:成功时,index.cgi会重定向到执行第一次爬网的m_s_dev.html。