Question

我有5亿个对象，其中每个对象都有n个联系人，如下所示

var groupsArray = [
                    {'G1': ['C1','C2','C3'....]},
                    {'G2': ['D1','D2','D3'....]}
                     ...
                    {'G2000': ['D2001','D2002','D2003'....]}
                     ...
                ]

我在nodejs中有两种实现方式，它基于常规promise，另一种方法使用bluebird，如下所示

定期承诺

...
var groupsArray = [
                    {'G1': ['C1','C2','C3']},
                    {'G2': ['D1','D2','D3']}
                ]

function ajax(url) {
  return new Promise(function(resolve, reject) {
        request.get(url,{json: true}, function(error, data) {
            if (error) {
                reject(error);
            } else {
                resolve(data);  
            }
        });
    });
}
_.each(groupsArray,function(groupData){
    _.each(groupData,function(contactlists,groupIndex){
        // console.log(groupIndex)
        _.each(contactlists,function(contactData){
            ajax('http://localhost:3001/api/getcontactdata/'+groupIndex+'/'+contactData).then(function(result) {
                console.log(result.body);
              // Code depending on result
            }).catch(function() {
              // An error occurred
            });
        })
    })
})
...

使用bluebird方式我使用并发来检查如何控制promises队列

...
_.each(groupsArray,function(groupData){
    _.each(groupData,function(contactlists,groupIndex){
        var contacts = [];
        // console.log(groupIndex)
        _.each(contactlists,function(contactData){
            contacts.push({
                contact_name: 'Contact ' + contactData
            });
        })
        groups.push({
            task_name: 'Group ' + groupIndex,
            contacts: contacts
        });
    })
})

Promise.each(groups, group => 
    Promise.map(group.contacts,
         contact => new Promise((resolve, reject) => {
                /*setTimeout(() => 
                    resolve(group.task_name + ' ' + contact.contact_name), 1000);*/
                request.get('http://localhost:3001/api/getcontactdata/'+group.task_name+'/'+contact.contact_name,{json: true}, function(error, data) {
                    if (error) {
                        reject(error);
                    } else {
                        resolve(data);  
                    }
                });
}).then(log => console.log(log.body)), 
{
    concurrency: 50
}).then(() => console.log())).then(() => {
    console.log('All Done!!');
});
...

我想知道何时使用promises处理内部循环中的1亿个api调用。请告知最佳方式异步调用API并稍后处理响应。

Answer 1

我使用常规Node.js承诺的答案（这可能很容易适应Bluebird或其他库）。

您可以使用Promise.all：

立即触发所有承诺

var groupsArray = [
                    {'G1': ['C1','C2','C3']},
                    {'G2': ['D1','D2','D3']}
                ];


function ajax(url) {
  return new Promise(function(resolve, reject) {
        request.get(url,{json: true}, function(error, data) {
            if (error) {
                reject(error);
            } else {
                resolve(data);  
            }
        });
    });
}

Promise.all(groupsArray.map(group => ajax("your-url-here")))
    .then(results => {
        // Code that depends on all results.
    })
    .catch(err => {
        // Handle the error.
    });

使用Promise.all尝试并行运行所有请求。当你有5亿个请求同时尝试所有这些时，这可能不会很好地工作！

更有效的方法是使用JavaScript reduce函数依次对您的请求进行排序：

// ... Setup as before ...

const results = [];

groupsArray.reduce((prevPromise, group) => {
            return prevPromise.then(() => {
                return ajax("your-url-here")
                    .then(result => {
                        // Process a single result if necessary.
                        results.push(result); // Collect your results.
                    });
            });
        },
        Promise.resolve() // Seed promise.
    );
    .then(() => {
        // Code that depends on all results.
    })
    .catch(err => {
        // Handle the error.
    });

此示例将promises链接在一起，以便下一个只在前一个完成后才开始。

不幸的是，排序方法将非常缓慢，因为它必须等到每个请求完成后再开始新的请求。虽然每个请求都在进行中（发出API请求需要时间），但CPU处于空闲状态，而它可能正在处理另一个请求！

针对此问题的更有效但复杂的方法是使用上述方法的组合。您应该批量处理您的请求，以便并行执行每批（例如10个）中的请求，然后批次按顺序排序。

自己实施这一点很棘手 - 尽管这是一项很好的学习练习 - 使用Promise.all和reduce函数的组合，但我建议使用库async-await-parallel。有很多这样的库，但我使用这个库，它运行良好，很容易完成你想要的工作。

您可以像这样安装库：

npm install --save async-await-parallel

以下是您将如何使用它：

const parallel = require("async-await-parallel");

// ... Setup as before ...

const batchSize = 10;

parallel(groupsArray.map(group => {
        return () => { // We need to return a 'thunk' function, so that the jobs can be started when they are need, rather than all at once.
            return ajax("your-url-here");               
        }
    }, batchSize)
    .then(() => {
        // Code that depends on all results.
    })
    .catch(err => {
        // Handle the error.
    });

这样做更好，但它仍然是制作如此大量请求的笨重方式！也许你需要提高赌注并考虑将时间投入到适当的异步工作管理中。

我最近一直在使用Kue来管理一组工作进程。将Kue与Node.js集群库结合使用，可以在多核PC上实现正确的并行性，然后如果需要更多的咕噜声，可以轻松地将其扩展到多个基于云的虚拟机。

有关某些Kue示例代码，请参阅my answer here。

Answer 2

在我看来，你有两个问题加在一个问题上 - 我将它们解耦。

＃1加载大型数据集

对如此大的数据集（500米记录）进行操作肯定会迟早会导致一些内存限制问题 - node.js在一个线程中运行，并且仅限于使用大约1.5GB的内存 - 之后您的进程将崩溃

为了避免您将数据作为CSV中的流阅读 - 我会使用scramjet，因为它会帮助我们解决第二个问题，但{{1} }或JSONStream也会做得很好：

papaparse

然后让我们读取数据 - 我假设从CSV：

$ npm install --save scramjet

现在我们有一个对象流，它们将逐行返回数据，但只有在我们读取它时才会这样。解决了问题＃1，现在解决了＃34;扩充＆＃34;流：

＃2流数据异步扩充

不用担心 - 这就是你所做的 - 对于你想要从某些API获取一些额外信息（如此增加）的每一行数据，默认情况下这是异步的。

const {StringStream} = require("scramjet"); const stream = require("fs") .createReadStream(pathToFile) .pipe(new StringStream('utf-8')) .csvParse()只需几行即可开始：

scramjet

在此之后，您需要累积数据或将其输出到流 - 有很多选项 - 例如：stream .flatMap(groupData => Object.entries(groupData)) .flatMap(([groupIndex, contactList]) => contactList.map(contactData => ([contactData, groupIndex]) // now you have a simple stream of entries for your call .map(([contactData, groupIndex]) => ajax('http://localhost:3001/api/getcontactdata/'+groupIndex+'/'+contactData)) // and here you can print or do anything you like with your data stream .each(console.log)。

使用scramjet，您可以将流程分成多行，而不会对性能产生太大影响。使用.toJSONArray().pipe(fileStream)可以控制并发性，最重要的是，所有这些都将以最小的内存占用量运行 - 比将整个数据加载到内存中要快得多。

如果这有用，请告诉我 - 您的问题非常复杂，如果您遇到任何问题，请告诉我 - 我很乐意提供帮助。：）

使用Promises在for循环中调用API的最佳方法

2 个答案:

＃1加载大型数据集

＃2流数据异步扩充