Question

我必须抓取一个网站，所以我使用Scrapy来做，但我需要传递一个cookie来绕过第一页（这是一种登录页面，你选择你的位置）

我在网上听说你需要用基础蜘蛛（不是爬行蜘蛛）来做这件事，但是我需要使用爬行蜘蛛进行爬行，所以我需要做什么？

起初是一只Base Spider？然后启动我的爬行蜘蛛？但我不知道cookie是否会在他们之间传递或我该怎么做？如何从另一只蜘蛛发射蜘蛛？

如何处理cookie？我试过这个

def start_requests(self):
   yield Request(url='http://www.auchandrive.fr/drive/St-Quentin-985/', cookies={'auchanCook': '"985|"'})

但没有工作

我的答案应该是here，但这个家伙真的很躲躲闪闪，而且我不知道该怎么做。

Answer 1

首先，您需要在settings.py文件中添加开放式Cookie

COOKIES_ENABLED = True

这是我测试的蜘蛛代码供您参考。我测试了它并通过了

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy import log

class Stackoverflow23370004Spider(CrawlSpider):
    name = 'auchandrive.fr'
    allowed_domains = ["auchandrive.fr"]

    target_url = "http://www.auchandrive.fr/drive/St-Quentin-985/"

    def start_requests(self):
        yield Request(self.target_url,cookies={'auchanCook': "985|"}, callback=self.parse_page)

    def parse_page(self, response):        
        if 'St-Quentin-985' in response.url:
            self.log("Passed : %r" % response.url,log.DEBUG)
        else:
            self.log("Failed : %r" % response.url,log.DEBUG)

您可以运行命令来测试和观察控制台输出：

scrapy crawl auchandrive.fr

Answer 2

我注意到在您的代码段中，您使用的是cookies={'auchanCook': '"985|"'}，而不是cookies={'auchanCook': "985|"}。

这应该让你开始：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request


class AuchanDriveSpider(CrawlSpider):
    name = 'auchandrive'
    allowed_domains = ["auchandrive.fr"]

    # pseudo-start_url
    begin_url = "http://www.auchandrive.fr/"

    # start URL used as shop selection
    select_shop_url = "http://www.auchandrive.fr/drive/St-Quentin-985/"

    rules = (
        Rule(SgmlLinkExtractor(restrict_xpaths=('//ul[@class="header-menu"]',))),
        Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "vignette-content")]',)),
             callback='parse_product'),
    )

    def start_requests(self):
        yield Request(self.begin_url, callback=self.select_shop)

    def select_shop(self, response):
        return Request(url=self.select_shop_url, cookies={'auchanCook': "985|"})

    def parse_product(self, response):
        self.log("parse_product: %r" % response.url)

分页可能很棘手。

Scrapy Cookie操作怎么样？

2 个答案: