通过表单提交Scrapy身份验证?

时间:2015-03-17 17:25:31

标签: python forms authentication scrapy

表格来源:

        <!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en-US">
<head>
 <meta http-equiv="pragma" content="no-cache">
 <meta http-equiv='cache-control' content='no-cache'>
 <meta http-equiv='cache-control' content='no-store'>
 <meta http-equiv='cache-control' content='max-age=0'>
 <meta http-equiv='expires' content='0'>
 <meta http-equiv="content-type" content="text/html; charset=UTF-8">
 <meta http-equiv="content-style-type" content="text/css">
 <meta http-equiv="content-script-type" content="text/javascript">
 <link rel="stylesheet" href="../../css/common.css" type="text/css">
 <link rel="stylesheet" href="../../css/login.css"  type="text/css">
 <script type="text/javascript">
<!-- 
  function popupCopyright() {
    window.open( "copyright.html", "", "width=350,height=300" );
  }
//-->
 </script>
 <title>KONICA MINOLTA PageScope Web Connection</title>
</head>
<body>
 <div class="page_body">
  <div class="page_top">
   <a class="top_logo" href="http://konicaminolta.net" target="_blank"><img src="logo_companyL.gif" alt="KONICA MINOLTA Logo"></a>
   <a class="top_logo" href="copyright.html" onclick="popupCopyright();return false;" onkeypress="popupCopyright();return false;"><img src="logo_utilityL.gif" alt="PageScope Web Connection Logo"></a>
   <div class="tab_footer"></div>
  </div>
 <form name="lang_link" action="index.html" method="post" enctype="application/x-www-form-urlencoded">
  <input type="hidden" name="lang">
 </form>
 <form action="index.cgi" method="post" enctype="application/x-www-form-urlencoded">
<input type="hidden" name="lang" value="1">
  <div class="page_menu">
   <div class="page_footer">
   </div>
  </div>
  <div class="page_main">
   <h1 class="title">Language</h1>
   <select onchange="document.lang_link.lang.value=this.value;document.lang_link.submit();" name="linklist">
<option value="1" selected>English (English)</option>
<option value="2">Français (French)</option>
<option value="3">Italiano (Italian)</option>
<option value="4">Deutsch (German)</option>
<option value="5">Español (Spanish)</option>
<option value="6">Português (Portuguese)</option>
<option value="10">Čeština (Czech)</option>
<option value="12">Polski (Polish)</option>
<option value="14">Русский (Russian)</option>
<option value="15">Nederlands (Dutch)</option>
<option value="23">日本語 (Japanese)</option>
<option value="7">한국어 (Korean)</option>
<option value="8">简体中文 (Chinese-Simplified)</option>
<option value="9">繁體中文 (Chinese-Traditional)</option>
</select>
   <hr>
   <h1 class="title">Log in</h1>


   <dl class="main1">
    <dt class="check1"><input type="radio" name="reg" value="1" id="public"></dt>
    <dd class="check1"><label for="public">Public User</label></dd>
   </dl>



   <dl class="main1">
    <dt class="check1"><input type="radio" name="reg" value="4" id="admin"></dt>
    <dd class="check1"><label for="admin">Administrator</label></dd>
   </dl>
   <p class="attention">SSL is not set-up. Please set up SSL after admin logins to secure safety of the information.</p>
   <div class="page_footer">
    <hr class="page_boader">
    <input type="submit" value="Log in">
    <input type="reset" value="Clear">
   </div>
  </div>
 </form>
 </div>
</body>
</html>

我只是想跟踪scrapy docs进行身份验证,并且无法让它正确地模拟登录。

当前代码(来自StackoverFlow用户Acorn的使用示例):

import scrapy
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

from langmeters.items import LangItem

class LangSpider(InitSpider):
    name = "lang"
    allowed_domains = []
    login_page = 'http://192.168.3.189/index.html'
    start_urls = [
        "http://192.168.3.189/m_s_dev.html", "http://192.168.3.189/m_s_cnt_total.html"
    ]

    def init_request(self):
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        return FormRequest.from_response(response,
                    formdata={'reg' : '1'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        if "C3100P" in response.body:
            self.log("Successfully logged in. Crawling may start")
            self.initialized()
            print "finished"
        else:
            self.log("Failed Login!")


    def parse(self, response):
        item = LangItem()
        item['cmeter'] = response.xpath('//dt[contains(p, "Engine")]/following-sibling::dd/text()').extract()

我每次登录失败,所以我显然没有发送正确的输入。我在提交过程中注意到我得到一个“index.cgi”地址,如果我不选择公共或管理员,只返回index.html,只选择在没有任何一个的情况下登录。

注意:成功时,index.cgi会重定向到执行第一次爬网的m_s_dev.html。

0 个答案:

没有答案