我必须制作webscraping应用程序,但我读到了如果会有很多请求,那么网站可以阻止我的IP。
demo[, val := 1]
#add missing years for IDs:
demo <- demo[CJ(ID = unique(ID), Year = unique(Year)), on = .(ID, Year)]
demo[is.na(val), val := 0L]
#calculate differences:
demo <- demo[, .(Years = paste(head(Year, -1), tail(Year, -1), sep = "-"),
Diff = diff(val)), by = ID]
dcast(demo, ID ~ Years)
# ID 2011-2012 2012-2013 2013-2014 2014-2015
#1: A 0 0 0 0
#2: B 1 0 -1 1
#3: C 0 1 0 0
#4: D 0 1 -1 1
这是代码的开头。我必须从DB获取链接,并在每次迭代中进入它并获取数据。如何设置每个迭代行[i] .url的间隔?例如,为每个请求设置2分钟。请帮忙 ! :)
答案 0 :(得分:2)
也许你应该尝试使用setTimeout这样的东西?
cn.query('SELECT url FROM models', function(err, rows, field) {
let timeout = 2000; // 2 seconds
let doRequest = (it, row) => {
setTimeout(() => {
request(row.url, (err, res, body) => {
if (!err && res.statusCode === 200) {
const $ = cheerio.load(body);
}
});
}, it * timeout);
};
// Loop and call doRequest for each iteration
for (let [it, row] of rows.entries()) {
doRequest(it, row);
}
});
希望它有所帮助。
答案 1 :(得分:0)
使用 async 库,如下所示:
let interval = 5000;
async.eachSeries(urlList, function (url, done) {
setTimeout(function () {
request(url, function(error, resp, body) {
if (error) return callback(error);
var $ = cheerio.load(body);
done();
});
}, interval);
}, function (err) {
if (!err) callback();
});