中断`request`在'forEach`循环中提高效率

时间:2018-05-01 22:03:48

标签: javascript node.js asynchronous

我正在构建一个简单的网络抓取工具来自动化简报,这意味着我只需要浏览一定数量的网页。在这个例子中,它不是什么大问题,因为脚本只会抓取3个额外的页面。但对于不同的情况,这将是非常低效的。

所以我的问题是,是否有办法在此request()循环中停止执行forEach

或者我是否需要逐个更改抓取页面的方法,如outlined in this guide.

脚本

'use strict';
var request = require('request');
var cheerio = require('cheerio');
var BASEURL = 'https://jobsite.procore.com';

scrape(BASEURL, getMeta);

function scrape(url, callback) {
  var pages = [];
  request(url, function(error, response, body) {
    if(!error && response.statusCode == 200) {

      var $ = cheerio.load(body);

      $('.left-sidebar .article-title').each(function(index) {
        var link = $(this).find('a').attr('href');
        pages[index] = BASEURL + link;
      });
      callback(pages, log);
    }
  });
}

function getMeta(pages, callback) {
  var meta = [];
  // using forEach's index does not work, it will loop through the array before the first request can execute
  var i = 0;
  // using a for loop does not work here
  pages.forEach(function(url) {
    request(url, function(error, response, body) {
      if(error) {
        console.log('Error: ' + error);
      }

      var $ = cheerio.load(body);

      var desc = $('meta[name="description"]').attr('content');
      meta[i] = desc.trim();

      i++;

      // Limit
      if (i == 6) callback(meta);
      console.log(i);
    });
  });
}

function log(arr) {
  console.log(arr);
}

输出

$ node crawl.js 
1
2
3
4
5
6
[ 'Find out why fall protection (or lack thereof) lands on the Occupational Safety and Health Administration (OSHA) list of top violations year after year.',
  'noneChances are you won’t be seeing any scented candles on the jobsite anytime soon, but what if it came in a different form? The allure of smell has conjured up some interesting scent technology in recent years. Take for example the Cyrano, a brushed-aluminum cylinder that fits in a cup holder. It’s Bluetooth-enabled and emits up to 12 scents or smelltracks that can be controlled using a smartphone app. Among the smelltracks: “Thai Beach Vacation.”',
  'The premise behind the hazard communication standard is that employees have a right to know the toxic substances and chemical hazards they could encounter while working. They also need to know the protective things they can do to prevent adverse effects of working with those substances. Here are the steps to comply with the standard.',
  'The Weitz Company has been using Procore on its projects for just under two years. Within that time frame, the national general contractor partnered with Procore to implement one of the largest technological advancements in its 163-year history.  Click here to learn more about their story and their journey with Procore.',
  'MGM Resorts International is now targeting Aug. 24 as the new opening date for the $960 million hotel and casino complex it has been building in downtown Springfield, Massachusetts.',
  'So, what trends are taking center stage this year? Below are six of the most prominent. Some of them are new, and some of them are continuations of current trends, but they are all having a substantial impact on construction and the structures people live and work in.' ]
7
8
9

3 个答案:

答案 0 :(得分:1)

除了使用slice限制您的选择外,您还可以重构代码以重用某些功能。

很抱歉,在考虑了这件事之后,我无法自拔。

我们可以从重构开始:

const rp = require('request-promise-native');
const {load} = require('cheerio');

function scrape(uri, transform) {
  const options = {
    uri,
    transform: load
  };

  return rp(options).then(transform);
}

scrape(
  'https://jobsite.procore.com',
  ($) => $('.left-sidebar .article-title a').toArray().slice(0,6).map((linkEl) => linkEl.attribs.href)
).then((links) => Promise.all(
  links.map(
    (link) => scrape(
      `https://jobsite.procore.com/${link}`,
      ($) => $('meta[name="description"]').attr('content').trim()
    )
  )
)).then(console.log).catch(console.error);

虽然这确实使代码更加干燥和简洁,但它指出了可能需要改进的一部分:请求链接。

目前它几乎会立即触发原始页面上所有(或最多)6个链接的请求。这可能是你想要的,也可能不是你想要的,这取决于你提到的其他一些要求的链接数量。

另一个潜在的问题是错误管理。当重构成立时,如果任何一个请求失败,那么所有请求都将被丢弃。

如果你喜欢这种方法,只需要考虑几点。 两者都可以通过多种方式解决。

答案 1 :(得分:0)

没有办法阻止forEach。您可以通过检查forEach内的标志来模拟停止,但仍会循环遍历所有元素。顺便说一下,使用循环进行io操作并不是最佳的。

正如您所说,处理一组不断增加的数据的最佳方法是逐个处理,但我会添加一个扭曲:一个接一个的螺纹。

  

注意:使用线程我并不是指实际线程。更多的是   “多重工作”的定义。由于IO操作没有锁定   主线程,而一个或多个请求正在等待数据,   其他“工作线”可以运行JavaScript来处理数据   收到,因为JavaScript是单线程的(不谈论   WebWorkers)。

就像拥有一个页面数组一样简单,它接收要动态抓取的页面,一个函数读取该数组的一个页面,处理结果然后返回到起始点(加载下一页数组和处理结果)。

现在你只需要调用该函数即可运行并完成的线程数量。伪代码:

var pages = [];

function loadNextPage() {
    if (pages.length == 0) {
        console.log("Thread ended");
        return;
    }
    var page = shift(); // get the first element
    loadAndProcessPage(page, loadNextPage);
}

loadAndProcessPage(page, callback) {
    requestOrWhatever(page, (error, data) => {
        if (error) {
            // retry or whatever
        } else {
            processData(data);
            callback();
        }
    });
}

function processData(data) {
    // Process the data and push new links to the pages array
    pages.push(data.link1);
    pages.push(data.link2);
    pages.push(data.link3);
}

console.log("Start new thread");
loadNextPage();

console.log("And another one");
loadNextPage();

console.log("And another one");
loadNextPage();

console.log("And another thread");
loadNextPage();

当数组中没有更多页面时,此代码将停止,并且如果在某些时候发生的页面数量少于线程数量,则线程将关闭。需要一些调整,但你明白了。

答案 2 :(得分:0)

我假设你试图在一些页面之后停止执行(在你的例子中它看起来像六个)。正如其他一些回复所述,您无法阻止从Array.prototype.forEach()执行回调,但是在每次执行时都可以阻止运行请求调用。

function getMeta(pages, callback) {
    var meta = []
    var i = 0
    pages.forEach(url => {
        // MaxPages you were looking for
        if(i <= maxPages)
            request((err, res, body) => {
                // ... Request logic
            })
    })

你也可以使用while循环来迭代遍历每个页面,一旦我点击你希望循环将退出的值而没有在其他页面上运行