在scrapy中设置粘性饼干

时间:2012-08-14 09:44:14

标签: python cookies scrapy

我正在抓取的网站有javascript设置cookie并在后端检查它以确保启用js。从HTML代码中提取cookie很简单,但是设置它似乎是scrapy中的一个问题。所以我的代码是:

from scrapy.contrib.spiders.init import InitSpider

class TestSpider(InitSpider):
    ...
    rules = (Rule(SgmlLinkExtractor(allow=('products/./index\.html', )), callback='parse_page'),)

    def init_request(self):
        return Request(url = self.init_url, callback=self.parse_js)

    def parse_js(self, response):
        match = re.search('setCookie\(\'(.+?)\',\s*?\'(.+?)\',', response.body, re.M)
        if match:
            cookie = match.group(1)
            value = match.group(2)
        else:
            raise BaseException("Did not find the cookie", response.body)
        return Request(url=self.test_page, callback=self.check_test_page, cookies={cookie:value})

    def check_test_page(self, response):
        if 'Welcome' in response.body:
            self.initialized()

    def parse_page(self, response):
        scraping....

我可以看到内容在check_test_page中可用,Cookie完美无缺。但它永远不会到达parse_page,因为没有正确cookie的CrawlSpider没有看到任何链接。有没有办法在抓取会话期间设置cookie?或者我是否必须使用BaseSpider并手动将cookie添加到每个请求中?

一个不太理想的替代方案是通过scrapy配置文件以某种方式设置cookie(值似乎永远不会改变)。这可能吗?

2 个答案:

答案 0 :(得分:1)

我之前没有使用InitSpider

查看scrapy.contrib.spiders.init.InitSpider中的代码,我看到了:

def initialized(self, response=None):
    """This method must be set as the callback of your last initialization
    request. See self.init_request() docstring for more info.
    """
    self._init_complete = True
    reqs = self._postinit_reqs[:]
    del self._postinit_reqs
    return reqs

def init_request(self):
    """This function should return one initialization request, with the
    self.initialized method as callback. When the self.initialized method
    is called this spider is considered initialized. If you need to perform
    several requests for initializing your spider, you can do so by using
    different callbacks. The only requirement is that the final callback
    (of the last initialization request) must be self.initialized. 

    The default implementation calls self.initialized immediately, and
    means that no initialization is needed. This method should be
    overridden only when you need to perform requests to initialize your
    spider
    """
    return self.initialized()

您写道:

  

我可以看到内容在Cookie check_test_page中可用   工作得很好。但它从来没有到达parse_page   没有正确Cookie的CrawlSpider没有看到任何链接。

我认为parse_page未被调用,因为您没有使用self.initialized作为回调制作请求。

我认为这应该有效:

def check_test_page(self, response):
    if 'Welcome' in response.body:
        return self.initialized()

答案 1 :(得分:0)

原来,InitSpider是一个BaseSpider。所以它看起来像1)在这种情况下没有办法使用CrawlSpider 2)没有办法设置粘性cookie