我正在研究Scrapy蜘蛛。我在网上找到了一些如何处理需要登录的网站的例子(包括堆栈溢出)。我遇到了一个我在其他任何地方都没见过的问题。当我运行包含的代码时,它将运行爬虫,但是当它尝试使用FormRequest.form_response方法时,它会出错并出现以下错误:
2016-02-22 04:07:11 [schwab] DEBUG: init_request
2016-02-22 04:07:11 [scrapy] INFO: Spider opened
2016-02-22 04:07:11 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-22 04:07:11 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-02-22 04:07:12 [scrapy] DEBUG: Crawled (200) <GET https://www.****.com> (referer: None)
2016-02-22 04:07:12 [schwab] DEBUG: logging in...
2016-02-22 04:07:12 [schwab] DEBUG: <200 https://www.****.com>
2016-02-22 04:07:12 [scrapy] ERROR: Spider error processing <GET https://www.****.com> (referer: None)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/pi/Projects/savingsScript/savingsScript/spiders/example.py", line 39, in login
return scrapy.FormRequest.form_response(
AttributeError: type object 'FormRequest' has no attribute 'form_response'
2016-02-22 04:07:12 [scrapy] INFO: Closing spider (finished)
2016-02-22 04:07:12 [scrapy] INFO: Dumping Scrapy stats:
另一个注意事项是,当我在scrapy http lib中查看函数FormRequest.form_response时,它似乎列出了一个在响应之前的初始参数&#39;我提供的论点。这是我的参数不匹配方法的函数签名的问题吗?任何见解都将不胜感激。
lib中的函数签名似乎是:
def from_response(cls, response, formname=None, formid=None, formnumber=0, formdata=None,
clickdata=None, dont_click=False, formxpath=None, formcss=None, **kwargs):
产生此错误的抓取工具代码的当前状态如下:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders.init import InitSpider
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy.spiders import Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class ExampleSpider(InitSpider):
name = "****"
allowed_domains = ["****.com"]
login_page = 'https://www.****.com'
start_urls = (
'https://www.****.com/',
)
login_user = "****"
login_pass = "****"
rules = (
Rule(SgmlLinkExtractor(allow=r'-\w+.html$'),
callback='parse_item', follow=True),
)
def parse(self,response):
self.log('testing')
pass
def init_request(self):
self.log('init_request')
return Request(url=self.login_page, callback=self.login)
def login(self, response):
self.log('logging in...')
self.log(response)
return scrapy.FormRequest.form_response(
response,
formName='SignonForm',
formdata={'SignonAccountNumber': self.login_user, 'SignonPassword': self.login_pass},
callback=self.check_login_response
)
def check_login_response(self, response):
self.log('check_login_response')
if "<li class=\"logout\">" in response.body:
self.log('signed in correctly')
self.initialized()
else:
self.log('still not signed in...')
def parse_item(self, response):
console.log('parse_item')
i['url'] = response.url
console.log('response.url:' + response.url)
return i
答案 0 :(得分:2)
它的from_response,而不是form_response!诡异的命名。