Scrapy FormRequest form_response

时间:2016-02-22 04:20:45

标签: python scrapy web-crawler

我正在研究Scrapy蜘蛛。我在网上找到了一些如何处理需要登录的网站的例子(包括堆栈溢出)。我遇到了一个我在其他任何地方都没见过的问题。当我运行包含的代码时,它将运行爬虫,但是当它尝试使用FormRequest.form_response方法时,它会出错并出现以下错误:

2016-02-22 04:07:11 [schwab] DEBUG: init_request
2016-02-22 04:07:11 [scrapy] INFO: Spider opened
2016-02-22 04:07:11 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-22 04:07:11 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-02-22 04:07:12 [scrapy] DEBUG: Crawled (200) <GET https://www.****.com> (referer: None)
2016-02-22 04:07:12 [schwab] DEBUG: logging in...
2016-02-22 04:07:12 [schwab] DEBUG: <200 https://www.****.com>
2016-02-22 04:07:12 [scrapy] ERROR: Spider error processing <GET https://www.****.com> (referer: None)
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/pi/Projects/savingsScript/savingsScript/spiders/example.py", line 39, in login
    return scrapy.FormRequest.form_response(
AttributeError: type object 'FormRequest' has no attribute 'form_response'
2016-02-22 04:07:12 [scrapy] INFO: Closing spider (finished)
2016-02-22 04:07:12 [scrapy] INFO: Dumping Scrapy stats:

另一个注意事项是,当我在scrapy http lib中查看函数FormRequest.form_response时,它似乎列出了一个在响应之前的初始参数&#39;我提供的论点。这是我的参数不匹配方法的函数签名的问题吗?任何见解都将不胜感激。

lib中的函数签名似乎是:

def from_response(cls, response, formname=None, formid=None, formnumber=0, formdata=None,
                  clickdata=None, dont_click=False, formxpath=None, formcss=None, **kwargs):

产生此错误的抓取工具代码的当前状态如下:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders.init import InitSpider
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy.spiders import Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class ExampleSpider(InitSpider):
    name = "****"
    allowed_domains = ["****.com"]
    login_page = 'https://www.****.com'
    start_urls = (
        'https://www.****.com/',
    )

    login_user = "****"
    login_pass = "****"

    rules = (
              Rule(SgmlLinkExtractor(allow=r'-\w+.html$'),
                   callback='parse_item', follow=True),
            )

    def parse(self,response):
      self.log('testing')
      pass

    def init_request(self):
      self.log('init_request')
      return Request(url=self.login_page, callback=self.login)

    def login(self, response):
      self.log('logging in...')
      self.log(response)
      return scrapy.FormRequest.form_response(
                                         response,
                                         formName='SignonForm',
                                         formdata={'SignonAccountNumber': self.login_user, 'SignonPassword': self.login_pass},
                                         callback=self.check_login_response
                                        )

    def check_login_response(self, response):
      self.log('check_login_response')
      if "<li class=\"logout\">" in response.body:
        self.log('signed in correctly')
        self.initialized()
      else:
        self.log('still not signed in...')

    def parse_item(self, response):
      console.log('parse_item')
      i['url'] = response.url
      console.log('response.url:' + response.url)
      return i

1 个答案:

答案 0 :(得分:2)

它的from_response,而不是form_response!诡异的命名。