Scrapy / Python - 应该多次调用url(感谢循环)。只打了一次。 (dont_filter不工作)

时间:2014-11-13 16:59:40

标签: python python-2.7 scrapy web-crawler

我不确定scrapy是如何运作的。我做了一个几乎完美的爬行器。我有一份dict列表。 (config.products)这些dict包含一个必须在函数initial_search中发送的POST。因此initial_search必须多次调用,但是现在initial_search发送的POST只进行一次,爬虫正在关闭。我添加了dont_filter = True,但这没有任何改变。有谁知道出了什么问题?

def parse(self, response):
    return scrapy.FormRequest.from_response(
        response,
        meta={'product':config.products[0]},
        callback=self.initial_search
    )


def initial_search(self, response):
    config.actualProduct = response.meta['product']
    if config.products.index(config.actualProduct) == 0:
        config.savedResponse = response

    # The second time, the request is not made. (even with dont_filter=True)

    return scrapy.FormRequest(
        url=response.url,
        formdata=dictArgs,
        meta={'dictArgs': config.actualProduct},
        dont_filter = True,
        callback=self.other_function
    )

def other_function(self, response):
    return scrapy.FormRequest(
        url=response.url,
        formdata=dictArgs,
        meta={'dictArgs': config.actualProduct},
        callback=self.other_function2
    )

def other_function2(self, response):
        nextPosition = config.products.index(config.actualProduct) + 1

        # Checking if we have another dict to post

        if nextPosition < len(config.products):
            config.savedResponse.meta['product'] = config.products[nextPosition]
            self.initial_search(config.savedResponse)

任何帮助将不胜感激

1 个答案:

答案 0 :(得分:0)

事实上,您没有正确地在initial_search中呼叫other_function2。这是它应该是这样的:

def other_function2(self, response):
        nextPosition = config.products.index(config.actualProduct) + 1

        # Checking if we have another dict to post

        if nextPosition < len(config.products):
            config.savedResponse.meta['product'] = config.products[nextPosition]
            yield scrapy.Request(
                config.savedResponse,
                meta={'product':config.products[nextPosition]},
                callback=self.initial_search
                )