带有节点js的foor循环中的同步多个请求

时间:2017-12-26 18:42:44

标签: javascript node.js web-scraping request x-ray

我是从Javascript开始的,我需要帮助来弄清楚如何在循环for循环时使这段代码同步。 基本上我正在做的是在for循环中发出多个POST请求然后使用库X-Ray来废弃数据,最后我将结果保存到Mongo数据库。 输出没问题,但它以无序方式出现并突然挂起,我必须使用ctrl + C强行关闭。这是我的功能:

  function getdata() {
  const startYear = 1996;
  const currentYear = 1998; // new Date().getFullYear()

for (let i = startYear; i <= currentYear; i++) {
for (let j = 1; j <= 12; j++) {
  if (i === startYear) {
    j = 12;
  }

  // Form to be sent
  const form = {
    year: `${i}`,
    month: `${j}`,
    day: '01',
  };

  const formData = querystring.stringify(form);
  const contentLength = formData.length;

  // Make HTTP Request
  request({
    headers: {
      'Content-Length': contentLength,
      'Content-Type': 'application/x-www-form-urlencoded',
    },
    uri: 'https://www.ipma.pt/pt/geofisica/sismologia/',
    body: formData,
    method: 'POST',
  }, (err, res, html) => {

    if (!err && res.statusCode === 200) {

      // Scrapping data with X-Ray
      x(html, '#divID0 > table > tr', {
        date: '.block90w',
        lat: 'td:nth-child(2)',
        lon: 'td:nth-child(3)',
        prof: 'td:nth-child(4)',
        mag: 'td:nth-child(5)',
        local: 'td:nth-child(6)',
        degree: 'td:nth-child(7)',
      })((error, obj) => {

        const result = {
          date: obj.date,
          lat: obj.lat.replace(',', '.'),
          lon: obj.lon.replace(',', '.'),
          prof: obj.prof == '-' ? null : obj.prof.replace(',', '.'),
          mag: obj.mag.replace(',', '.'),
          local: obj.local,
          degree: obj.degree,
        };

        // console.log(result);

        upsertEarthquake(result); // save to DB

      });

    }


  });

  }
  }
  }

我想我必须使用promises或callbacks但我无法理解如何执行此操作,并且我已经尝试使用async await但没有成功。如果需要提供任何其他信息,请告诉我,谢谢。

3 个答案:

答案 0 :(得分:1)

您正在循环中调用请求。

异步函数是在主线程逻辑结束后获取结果(A.K.A.,在回调函数中接收响应)的函数。

这样,如果我们有这个:

dflist1 <- list(household2010, household2011)

dflist2 <- list(person2011,    person2011)

lapply(function(x) left_join(dflist, dflist2, by = "id")

逻辑将在调用回调之前在12 for (var i = 0; i < 12; i++) { request({ data: i }, function(error, data) { // This is the request result, inside a callback function }); } 上运行,因此回调将在所有主循环运行后堆叠并调用。

没有进入所有ES6生成器(因为我认为它使它更复杂,并且在低级别学习正在发生的事情对你更好),你要做的就是调用{{1等待调用他的回调函数并调用下一个request。怎么做?有很多方法,但我通常会这样:

request

在这里你看到了逻辑。你有一个名为request的函数,如果不再需要调用,它将进行下一次调用或调用var i= 0; function callNext() { if (i>= 12) { requestEnded(); } else { request({ data: i++ // Increment the counter as we are not inside a for loop that increments it }, function(error, data) { // Do something with the data, and also check if an error was received and act accordingly, which is very much possible when talking about internet requests console.log(error, data); // Call the next request inside the callback, so we are sure that the next request is ran just after this request has ended callNext(); }) } } callNext(); requestEnded() { console.log("Yay"); }

callNext内调用requestEnded时,它将等待接收回调(这将在异地,将来的某个时间发生),将处理收到的数据然后在回调中告诉他再次打电话request

答案 1 :(得分:-1)

您可以使用开始年份和结束年份创建数组,然后将其映射到您的请求的配置,然后将其结果映射到X射线返回的数据(x-ray返回promise like,而不是循环,而不是循环需要回调)。然后使用返回promise的函数将scrape的结果放在mongodb中。

如果某些内容被拒绝,则创建一个Fail类型对象并使用该对象解析。

使用Promise.all并行启动所有请求,x-ray和mongo,但使用throttle限制活动请求的数量。

以下是代码中的内容:

//you can get library containing throttle here:
//  https://github.com/amsterdamharu/lib/blob/master/src/index.js
const lib = require('lib');
const Fail = function(details){this.details=details;};
const isFail = o=>(o&&o.constructor)===Fail;
const max10 = lib.throttle(10);
const range = lib.range;
const createYearMonth = (startYear,endYear)=>
  range(startYear,endYear)
  .reduce(
    (acc,year)=>
      acc.concat(
        range(1,12).map(month=>({year,month}))
      )
    ,[]
  );
const toRequestConfigs = yearMonths =>
  yearMonths.map(
    yearMonth=>{
      const formData = querystring.stringify(yearMonth);
      return {
        headers: {
          'Content-Length': formData.length,
          'Content-Type': 'application/x-www-form-urlencoded',
        },
        uri: 'https://www.ipma.pt/pt/geofisica/sismologia/',
        body: formData,
        method: 'POST',
      };
    }
  );
const scrape = html =>
  x(
    html, 
    '#divID0 > table > tr', 
    {
      date: '.block90w',
      lat: 'td:nth-child(2)',
      lon: 'td:nth-child(3)',
      prof: 'td:nth-child(4)',
      mag: 'td:nth-child(5)',
      local: 'td:nth-child(6)',
      degree: 'td:nth-child(7)'
    }
  );
const requestAsPromise = config =>
  new Promise(
    (resolve,reject)=>
      request(
        config,
        (err,res,html)=>
          (!err && res.statusCode === 200) 
            //x-ray returns a promise:
            // https://github.com/matthewmueller/x-ray#xraythencb
            ? resolve(html)
            : reject(err)
      )
  );
const someMongoStuff = scrapeResult =>
  //do mongo stuff and return promise
  scrapeResult;
const getData = (startYear,endYear) =>
  Promise.all(
    toRequestConfigs(
      createYearMonth(startYear,endYear)
    )
    .map(
      config=>
        //maximum 10 active requests
        max10(requestAsPromise)(config)
        .then(scrape)
        .then(someMongoStuff)
        .catch(//if something goes wrong create a Fail type object
          err => new Fail([err,config.body])
        )
    )
  )
//how to use:
getData(1980,1982)
.then(//will always resolve unless toRequestConfigs or createYearMonth throws
  result=>{
    //items that were successfull
    const successes = result.filter(item=>!isFail(item));
    //items that failed
    const failed = result.filter(isFail);
  }
)

抓取的内容很多,目标网站不允许您在y期间发出超过x个请求,并开始将您的IP列入黑名单并拒绝服务(如果您继续执行此操作)。

假设您希望每5秒限制10个请求,那么您可以将以上代码更改为:

const max10 = lib.throttlePeriod(10,5000);

其余代码是相同的

答案 2 :(得分:-1)

你的sync for...loop内有async methods问题。

解决这个问题的一个简单方法是使用

  

ES2017 async/await语法

假设您想在upsertEarthquake(result)之后停止每次迭代,您应该更改类似的代码。

function async getdata() {
    const startYear = 1996;
    const currentYear = 1998; // new Date().getFullYear()

    for (let i = startYear; i <= currentYear; i++) {
        for (let j = 1; j <= 12; j++) {
            if (i === startYear)
                j = 12; 

            // Form to be sent
            const form = {
                year: `${i}`,
                month: `${j}`,
                day: '01',
            };

            const formData = querystring.stringify(form);
            const contentLength = formData.length;
            //Make HTTP Request
            await new Promise((next, reject)=> { 
                request({
                    headers: {
                        'Content-Length': contentLength,
                        'Content-Type': 'application/x-www-form-urlencoded',
                    },
                    uri: 'https://www.ipma.pt/pt/geofisica/sismologia/',
                    body: formData,
                    method: 'POST',
                }, (err, res, html) => {
                    if (err || res.statusCode !== 200)
                        return next() //If there is an error jump to the next

                    //Scrapping data with X-Ray
                    x(html, '#divID0 > table > tr', {
                        date: '.block90w',
                        lat: 'td:nth-child(2)',
                        lon: 'td:nth-child(3)',
                        prof: 'td:nth-child(4)',
                        mag: 'td:nth-child(5)',
                        local: 'td:nth-child(6)',
                        degree: 'td:nth-child(7)',
                    })((error, obj) => {
                        const result = {
                            date: obj.date,
                            lat: obj.lat.replace(',', '.'),
                            lon: obj.lon.replace(',', '.'),
                            prof: obj.prof == '-' ? null : obj.prof.replace(',', '.'),
                            mag: obj.mag.replace(',', '.'),
                            local: obj.local,
                            degree: obj.degree,
                        }
                        //console.log(result);
                        upsertEarthquake(result); // save to DB
                        next() //This makes jump to the next for... iteration
                    })

                }) 
            }
        }
    }
}

我认为upsertEarthquake是一个异步函数,或者类型为fire and forget。

如果出现错误,您可以使用next(),但如果您想要打破循环,请使用reject()

if (err || res.statusCode !== 200)
    return reject(err)