Scrapy将链式请求合并为一个

时间:2018-06-11 05:35:26

标签: python scrapy scrapy-spider

我有一个场景,我正在浏览一个商店,浏览10页。然后当我找到我想要的物品时,我会把它添加到篮子里。

最后我想结帐。问题是,通过scrapy链接,它想要多次检查篮子,就像我在篮子里的物品一样。

如何将链接的请求合并为一个,所以在将10个项目添加到购物篮后,结帐仅被调用一次?

def start_requests(self):
    params = getShopList()
    for param in params:
        yield scrapy.FormRequest('https://foo.bar/shop', callback=self.addToBasket,
                                 method='POST', formdata=param)


def addToBasket(self, response):
    yield scrapy.FormRequest('https://foo.bar/addToBasket', callback=self.checkoutBasket,
                             method='POST',
                             formdata=param)

def checkoutBasket(self, response):
    yield scrapy.FormRequest('https://foo.bar/checkout', callback=self.final, method='POST',
                             formdata=param)

def final(self):
    print("Success, you have purchased 59 items")

编辑:

我尝试在已关闭的事件中发出请求,但它没有遇到请求也没有回调..

  def closed(self, reason):
        if reason == "finished":
            print("spider finished")
            return scrapy.Request('https://www.google.com', callback=self.finalmethod)
        print("Spider closed but not finished.")

    def finalmethod(self, response):
        print("finalized")

2 个答案:

答案 0 :(得分:0)

我认为你可以在蜘蛛完成时手动结帐:

def closed(self, reason):
    if reason == "finished":
        return requests.post(checkout_url, data=param)
    print("Spider closed but not finished.")

请参阅closed

更新

class MySpider(scrapy.Spider):
    name = 'whatever'

    def start_requests(self):
        params = getShopList()
        for param in params:
            yield scrapy.FormRequest('https://foo.bar/shop', callback=self.addToBasket,
                                     method='POST', formdata=param)


    def addToBasket(self, response):
        yield scrapy.FormRequest('https://foo.bar/addToBasket',
                                 method='POST', formdata=param)

    def closed(self, reason):
        if reason == "finished":
            return requests.post(checkout_url, data=param)
        print("Spider closed but not finished.")

答案 1 :(得分:0)

我通过使用Scrapy信号和spider_idle调用来解决它。

  

当蜘蛛闲置时发送,这意味着蜘蛛没有   进一步:

  • 等待下载的请求
  • 请求预定的项目
  • 在项目管道中处理

https://doc.scrapy.org/en/latest/topics/signals.html

from scrapy import signals, Spider

class MySpider(scrapy.Spider):
    name = 'whatever'

    def start_requests(self):
        self.crawler.signals.connect(self.spider_idle, signals.spider_idle) ## notice this
        params = getShopList()
        for param in params:
            yield scrapy.FormRequest('https://foo.bar/shop', callback=self.addToBasket,
                                     method='POST', formdata=param)


    def addToBasket(self, response):
        yield scrapy.FormRequest('https://foo.bar/addToBasket',
                                 method='POST', formdata=param)

    def spider_idle(self, spider): ## when all requests are finished, this is called
        req = scrapy.Request('https://foo.bar/checkout', callback=self.checkoutFinished)
        self.crawler.engine.crawl(req, spider)

    def checkoutFinished(self, response):
        print("Checkout finished")