Question

我正在node.js中读取一个文件（300,000行）。我想将5,000行的批量发送到另一个应用程序（Elasticsearch）来存储它们。因此，每当我读完5,000行时，我想通过API将它们批量发送到Elasticsearch来存储它们，然后继续阅读文件的其余部分并批量发送每5,000行。

如果我想使用java（或任何其他阻止语言，如C，C ++，python等）来执行此任务，我将执行以下操作：

int countLines = 0;
String bulkString = "";
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("filePath.txt")));
while ((currentLine = br.readLine()) != null) {
     countLines++;
     bulkString += currentLine;
     if(countLines >= 5000){
          //send bulkString to Elasticsearch via APIs
          countLines = 0;
          bulkString = "";
     }
}

如果我想对node.js做同样的事情，我会这样做：

var countLines = 0;
var bulkString = "";
var instream = fs.createReadStream('filePath.txt');
var rl = readline.createInterface(instream, outstream);
rl.on('line', function(line) {
     if(countLines >= 5000){
          //send bulkString to via APIs
          client.bulk({
          index: 'indexName',
          type: 'type',
          body: [bulkString]
          }, function (error, response) {
          //task is done
          });
          countLines = 0;
          bulkString = "";
     }
}

node.js的问题是它是非阻塞的，所以它不会在发送下一批行之前等待第一个API响应。我知道这可以算作done.js的一个好处，因为它不等待I / O，但问题是它向Elasticsearch发送了太多数据。因此，Elasticsearch的队列将变满，并且会抛出异常。

我的问题是如何让node.js在继续读取下一行之前或者在将下一批行发送到Elasticsearch之前等待来自API的响应。

我知道我可以在Elasticsearch中设置一些参数来增加队列大小，但是我对阻塞node.js的行为感兴趣。我熟悉回调的概念，但我想不出在这种情况下使用回调的方法来阻止node.js以非阻塞模式调用Elasticsearch API。

Answer 1

皮埃尔的回答是正确的。我只想提交一个代码，说明我们如何从node.js的非阻塞概念中受益，但同时，不要一次用太多的请求压倒Elasticsearch。

这是一个伪代码，您可以通过设置队列大小限制来为代码提供灵活性：

var countLines = 0;
var bulkString = "";
var queueSize = 3;//maximum of 3 requests will be sent to the Elasticsearch server
var batchesAlreadyInQueue = 0;
var instream = fs.createReadStream('filePath.txt');
var rl = readline.createInterface(instream, outstream);
rl.on('line', function(line) {
     if(countLines >= 5000){
          //send bulkString to via APIs
          client.bulk({
          index: 'indexName',
          type: 'type',
          body: [bulkString]
          }, function (error, response) {
               //task is done
               batchesAlreadyInQueue--;//we will decrease a number of requests that are already sent to the Elasticsearch when we hear back from one of the requests
               rl.resume();
          });
          if(batchesAlreadyInQueue >= queueSize){
               rl.pause();
          }
          countLines = 0;
          bulkString = "";
     }
}

Answer 2

在你的rl.pause()之后立即使用rl.resume()，//task is done之后使用public class DataApiControllerBase<T> : ApiController { public virtual List<ValidationResult> Validate(T input) { ... } }。

请注意，在调用暂停后，您可能会有更多的线路事件。

如何使用node.js或javascript延迟读取文件的行，而不是非阻塞行为？

2 个答案: