如何在scrapy中处理重复项?

时间:2017-10-28 21:01:58

标签: python-3.x scrapy

我正在学习scrapy,我有一些小项目。

def parse(self, response):
    links = LinkExtractor().extract_links(response)
    for link in links:
            yield response.follow(link, self.parse)

    if (some_condition):
        yield {'url': response.url}  # Store some data

所以我打开一个页面,获取所有链接并存储一些数据,如果我在这个页面上有一些数据。例如,如果我处理http://example.com/some_page,那么它下次会跳过它。我的任务是下次处理它。我想知道这个页面已经处理过,我需要在这种情况下存储一些其他数据。它应该是这样的:

def parse(self, response):

    if (is_duplicate):
        yield{} # Store some other data
    else:
        links = LinkExtractor().extract_links(response)
        for link in links:
                yield response.follow(link, self.parse)

        if (some_condition):
            yield {'url': response.url}  # Store some data

1 个答案:

答案 0 :(得分:1)

首先,您需要跟踪您访问的链接,其次,您必须告诉Scrapy您想要重复访问相同的页面。

以这种方式更改代码:

abs(f(guess))

在添加的构造函数中,TOL用于跟踪您已经访问过的链接。 (这里我假设你的蜘蛛类被命名为var _target = 40000; var _cashflow = [1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1398]; var _tolerance = 1e-8; var correctGuess = calcGuess(_target, _cashflow, _tolerance); console.log("Final guess:", correctGuess); function calcGuess(target, cashflow, tolerance) { var f = function(x) { var sum = 0.0; for (var i = 0; i < cashflow.length; i++) { sum += cashflow[i] * Math.pow(1 + guess, -i); } return sum - target; } // Derivative of f var df = function(x) { var sum = 0.0; for (var i = 0; i < cashflow.length; i++) { sum += cashflow[i] * (-i) * Math.pow(1 + guess, -i - 1); } return sum; } // Initial guess var guess = 1 - target / cashflow.reduce(function(a, b) { return a + b; }, 0); // Newton-Raphson for (var iter = 0; iter < 1000; iter++) { guess -= f(guess) / df(guess); if(Math.abs(f(guess)) < tolerance) { // Found guess, return break; } console.log(iter, ":", guess); } console.log("Difference:", f(guess)); return guess; },你没有共享这部分代码。)在def __init__(self, *args, **kwargs): super(MySpider, self).__init__(*args, **kwargs) self.visited_links = set() def parse(self, response): if response.url in self.visited_links: yield {} # Store some other data else: self.visited_links.add(response.url) links = LinkExtractor().extract_links(response) for link in links: yield response.follow(link, self.parse, dont_filter=True) if (some_condition): yield {'url': response.url} # Store some data 中,你首先检查链接是否已被访问(URL在{{ 1套)。如果没有,您将其添加到访问过的链接集中,当产生新的visited_links时(使用MySpider),您指示Scrapy不使用parse过滤重复的请求。