Question

我正在尝试使用InitSpider进行经过身份验证的scrapy登录。出于某种原因，InitSpider始终无法登录。我的代码类似于以下帖子中的答案：

Crawling LinkedIn while authenticated with Scrapy

我在日志中看到的响应是：

2012-12-20 22:56:53-0500 [linked] DEBUG: Redirecting (302) to <GET https://example.com/> from <POST https://example.com/>

使用上述帖子中的代码，我有相同的init_request，login和check_login_response功能。我可以在日志语句中看到它到达login函数，但它似乎永远不会到达check_login_response函数。

当我使用BaseSpider重新实现代码时，我在FormRequest函数中执行了parse，我可以无问题地登录。是否有一个原因？还有什么我应该做的吗？为什么我要使用InitSpider进行重定向登录？

[编辑]

class DemoSpider(InitSpider):
    name = 'linked'
    login_page = # Login URL
    start_urls = # All other urls

    def init_request(self):
        #"""This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        #"""Generate a login request."""
        return FormRequest.from_response(response, 
            formdata={'username': 'username', 'password': 'password'},
            callback=self.check_login_response)

    def check_login_response(self, response):
        #"""Check the response returned by a login request to see if we are successfully logged in."""
        if "Sign Out" in response.body:
            self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
            # Now the crawling can begin..
            return self.initialized()
        else:
            self.log("\n\n\nFailed, Bad times :(\n\n\n")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse(self, response):
        self.log('got to the parse function')

上面是我的蜘蛛代码。

Answer 1

经过一段时间的努力，我想出来了，我在我的博客上发布了解决方案：

http://tmblr.co/ZjkSZteCOTyH

基本上我使用BaseSpider并覆盖start_requests方法来处理登录。

与InitSpider有关的scrapy日志记录

1 个答案: