使用Scrapy进行抓取页面身份验证

时间:2014-12-17 08:17:08

标签: authentication scrapy

previous post中获取线索和想法,我试图提出自己的代码。

然而,使用我的代码我发现它并没有真正刮掉任何东西,可能根本不会超出身份验证级别。我这样说是因为即使我输入了错误的密码,我也看不到任何错误日志。

我最好的猜测是,身份验证字段的HTML不包含在"表单"标签,因此formdata可能会忽略它。可能是错的。

我的代码到目前为止:

class LoginSpider(BaseSpider):
    name = 'auth1'
    start_urls = ['http://www.example.com/administration']

    def parse(self, response):
        return [FormRequest.from_response(response,
                    formdata={'employee[email]': 'xyz@abc.com', 'employee[password]': 'XYZ'},
                    formxpath='//div[@class="form-row"]',
                    callback=self.after_login)]

    def after_login(self, response):
      if "authentication failed" in response.body:
        self.log("Login failed", level=log.ERROR)
        return
    # We've successfully authenticated, let's have some fun!
      else:
        return Request(url="http://www.liveyoursport.com/administration/customers",
               callback=self.parse_tastypage)


    def parse_tastypage(self, response):
      sel = Selector(response)
      item = Item()
      item ["Test"] = sel.xpath("//h1/text()").extract()
      yield item

这是HTML部分:

<div class="content-row">
  <div class="special-header-title span_full">
    <h3><span class="blue-text">Sign </span>In</h3>
  </div>
</div>
<div class="content-row">
  <div class="form-section checkout-address-edit span_80" id="sign-in-form" >
    <form accept-charset="UTF-8" action="/employees/sign_in" class="new_employee" id="new_employee" method="post"><div style="margin:0;padding:0;display:inline"><input name="utf8" type="hidden" value="&#x2713;" /><input name="authenticity_token" type="hidden" value="HQYZa0hNZ2Y+UvtbIk9OxI48Hlsnt+MiYOeV9ql2yWo=" /></div>
      <div>
        <div class="form-row">
          <div class="form-col-1"><label for="employee_email">Email</label></div>
          <div class="form-col-2">
            <input id="employee_email" name="employee[email]" size="30" type="email" value="" />
          </div>
        </div>
        <div class="form-row">
          <div class="form-col-1"><label for="employee_password">Password</label></div>
          <div class="form-col-2">
            <input id="employee_password" name="employee[password]" size="30" type="password" />
          </div>
        </div>
      </div>
      <div class="form-row form-row-controls">
        <div class="form-col-1"></div>
        <div class="form-col-2">
          <input class="sign-in-button f-right" name="commit" type="submit" value="Sign in" />
        </div>
      </div>
</form>    <br>


  <a href="/employees/password/new">Forgot your password?</a><br />


  <a href="/employees/unlock/new">Didn&#x27;t receive unlock instructions?</a><br />


  </div>

1 个答案:

答案 0 :(得分:1)

来自docs

  

formxpath(string) - 如果给定,则匹配xpath的第一个表单   将被使用。

但似乎您不匹配form,而是匹配父div 试试这样:

return [FormRequest.from_response(response,
                    formdata={'employee[email]': 'xyz@abc.com', 'employee[password]': 'XYZ'},
                    formxpath='//form[@id="new_employee"]',
                    callback=self.after_login)]

此外,如果您在网页上只有一个form元素,则无需定义formxpath