我想使用Scrappy登录网站,然后再调用另一个网址。 到目前为止,我安装了Scrappy并编写了这个脚本:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest
class LoginSpider2(BaseSpider):
name = 'github_login'
start_urls = ['https://github.com/login']
def parse(self, response):
return [FormRequest.from_response(response, formdata={'login': 'username', 'password': 'password'}, callback=self.after_login)]
def after_login(self, response):
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
else:
self.log("Login succeed", response.body)
启动此脚本后,我得到了“登录成功”的日志。 然后我添加了另一个URL,但它不起作用: 为此,我更换了:
start_urls = ['https://github.com/login']
通过
start_urls = ['https://github.com/login', 'https://github.com/MyCompany/MyPrivateRepo']
但我收到了这些错误:
2013-06-11 22:23:40+0200 [scrapy] DEBUG: Enabled item pipelines:
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 4, in <module>
execute()
File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 131, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 76, in _run_print_help
func(*a, **kw)
File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 138, in _run_command
cmd.run(args, opts)
File "/Library/Python/2.7/site-packages/scrapy/commands/crawl.py", line 43, in run
spider = self.crawler.spiders.create(spname, **opts.spargs)
File "/Library/Python/2.7/site-packages/scrapy/spidermanager.py", line 43, in create
raise KeyError("Spider not found: %s" % spider_name)
我做错了什么?我在stackoverflow上搜索但是我没有找到正确的答案..
谢谢你
答案 0 :(得分:1)
您的错误表明Scrapy无法找到蜘蛛。您是否在project / spiders文件夹中创建了它?
无论如何,一旦你运行它,你会发现第二个问题:start_url
请求的默认回调是self.parse
,这对于repo页面会失败(那里没有登录表单) 。它们可能并行运行,所以当它访问私人仓库时,它会出错:P
您应该只在[{1}}中保留登录网址,并在start_urls
方法中返回新的Request
,如果有效的话。像这样:
after_login
答案 1 :(得分:0)
蜘蛛的名称属性是否仍然正确设置? name
的设置不正确/缺失通常会导致错误。