我正在学习scrapy,我有一些小项目。
def parse(self, response):
links = LinkExtractor().extract_links(response)
for link in links:
yield response.follow(link, self.parse)
if (some_condition):
yield {'url': response.url} # Store some data
所以我打开一个页面,获取所有链接并存储一些数据,如果我在这个页面上有一些数据。例如,如果我处理http://example.com/some_page
,那么它下次会跳过它。我的任务是下次处理它。我想知道这个页面已经处理过,我需要在这种情况下存储一些其他数据。它应该是这样的:
def parse(self, response):
if (is_duplicate):
yield{} # Store some other data
else:
links = LinkExtractor().extract_links(response)
for link in links:
yield response.follow(link, self.parse)
if (some_condition):
yield {'url': response.url} # Store some data
答案 0 :(得分:1)
首先,您需要跟踪您访问的链接,其次,您必须告诉Scrapy您想要重复访问相同的页面。
以这种方式更改代码:
abs(f(guess))
在添加的构造函数中,TOL
用于跟踪您已经访问过的链接。 (这里我假设你的蜘蛛类被命名为var _target = 40000;
var _cashflow = [1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1934, 1398];
var _tolerance = 1e-8;
var correctGuess = calcGuess(_target, _cashflow, _tolerance);
console.log("Final guess:", correctGuess);
function calcGuess(target, cashflow, tolerance) {
var f = function(x) {
var sum = 0.0;
for (var i = 0; i < cashflow.length; i++) {
sum += cashflow[i] * Math.pow(1 + guess, -i);
}
return sum - target;
}
// Derivative of f
var df = function(x) {
var sum = 0.0;
for (var i = 0; i < cashflow.length; i++) {
sum += cashflow[i] * (-i) * Math.pow(1 + guess, -i - 1);
}
return sum;
}
// Initial guess
var guess = 1 - target / cashflow.reduce(function(a, b) { return a + b; }, 0);
// Newton-Raphson
for (var iter = 0; iter < 1000; iter++) {
guess -= f(guess) / df(guess);
if(Math.abs(f(guess)) < tolerance) {
// Found guess, return
break;
}
console.log(iter, ":", guess);
}
console.log("Difference:", f(guess));
return guess;
}
,你没有共享这部分代码。)在def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.visited_links = set()
def parse(self, response):
if response.url in self.visited_links:
yield {} # Store some other data
else:
self.visited_links.add(response.url)
links = LinkExtractor().extract_links(response)
for link in links:
yield response.follow(link, self.parse, dont_filter=True)
if (some_condition):
yield {'url': response.url} # Store some data
中,你首先检查链接是否已被访问(URL在{{ 1套)。如果没有,您将其添加到访问过的链接集中,当产生新的visited_links
时(使用MySpider
),您指示Scrapy不使用parse
过滤重复的请求。