Question

我正在node.js和Electron中构建一个网络爬虫。

基本上，该程序会接收一个起始网址，并抓取到某个深度，然后报告它找到某些关键字的位置。

这个工作太过分了，但我无法弄清楚如何实际告诉它什么时候完成。鉴于深度为3-4，这个程序似乎永远运行。如果深度较低，真正判断它是否还在爬行的唯一方法就是查看它正在使用的CPU /内存量。

以下是执行抓取的功能：

function crawl(startingSite, depth) {
if (depth < maxDepth) {
    getLinks(startingSite, function (sites) { //pulls all the links from a specific page and returns them as an array of strings
        for (var i = 0; i < sites.length; i++) { //for each string we got from the page
            findTarget(sites[i], depth); //find any of the keywords we want on the page, print out if so
            crawl(sites[i], depth + 1); //crawl all the pages on that page, and increase the depth
        }
    });
}
}

我的问题是，我无法弄清楚如何让这个功能在完成后报告。

我试过这样的事情：

function crawl(startingSite, depth, callback) {
if (depth < maxDepth) {
    getLinks(startingSite, function (sites) { //pulls all the links from a specific page and returns them as an array of strings
        for (var i = 0; i < sites.length; i++) { //for each string we got from the page
            findTarget(sites[i], depth); //find any of the keywords we want on the page, print out if so
            crawl(sites[i], depth + 1); //crawl all the pages on that page, and increase the depth
        }
    });
}
else
{
    callback();
}
}

但显然，callback（）会立即被调用，因为爬虫会快速达到深度并退出if语句。

我只需要将此函数打印出来（例如，对于console.log），只要它的所有递归实例都已完成爬行并达到最大深度。

有什么想法吗？

Answer 1

你可以使用承诺：

 const links = (start) =>
   new Promise(res => getLinks(start, res));

 async function crawl(startingSite, depth) {
    if (depth >= maxDepth) 
      return;

   const sites = await links(startingSite);
   for (const site of sites) { 
     await findTarget(site, depth);
     await crawl(site, depth + 1);
   }    
 }

然后就这样做：

  crawl(something, 0).then(() => console.log("done"));

报告node.js中递归函数的完成

1 个答案: