如何在异步调用后通过管道传输流而不丢失数据?

时间:2019-01-15 13:56:17

标签: node.js stream

在我的应用程序中,我希望能够执行以下步骤:

  1. 获取读取流;
  2. 等待异步功能完成;
  3. 将流插入目的地1;
  4. 等待另一个异步功能完成;
  5. 将目标1输送到目标2。

我希望以下几点:

  1. 流处理仅在步骤#5之后开始
  2. 数据不会丢失
  3. 流处理结束(.on("finish"))时,整个逻辑完全解决。

在提出任何问题之前,这是一个代码示例:

return new Promise(resolve => {
    logger.debug("Creating a stream");
    const stream = fs.createReadStream("/home/username/dev/resources/ex.tar.bz2");

    setTimeout(() => {
        logger.debug("Attaching pipe 1");
        const pipe1 = stream.pipe(
            through(
                function(data) {
                    logger.info("DATA in PIPE 1");
                    this.queue(data);
                },
                function() {
                    logger.info("END in PIPE 1");
                    this.queue(null);
                }
            )
        );

        stream.pause(); // LINE 1

        setTimeout(() => {
            logger.debug("Attaching pipe 2");
            const pipe2 = pipe1.pipe(
                through(
                    function() {
                        logger.info("DATA in PIPE 2");
                    },
                    function() {
                        logger.info("END in PIPE 2");
                        resolve();
                    }
                )
            )

            pipe2.resume(); // LINE 2
        }, 1000);
    }, 1000);
});

在此代码中,如果同时删除了LINE 1和LINE 2,则该代码将不起作用(在PIPE 1中打印 DATA 和在PIPE 1中打印 END ,从不解析),因为:

  • 附加目的地1开始数据流;
  • 如果我理解正确,那么到附加目的地2时,数据就已被消耗。

如果同时存在LINE 1和LINE 2,则代码出现正常工作(在PIPE 1中打印 DATA ,在PIPE 2中打印 DATA ,PIPE 1中的 END ,PIPE 2中的 END 并解决),因为:

  • LINE 1停止来自stream的数据流;
  • 附加目的地2(有些令人困惑)不会从原始来源开始流程;
  • 第2行开始数据流。

根据NodeJS文档:

  

如果存在管道目标,则调用stream.pause()不能保证一旦这些目标耗尽并请求更多数据,流就将保持暂停状态

哪个使我想到了我的主要问题:是否可以可靠地完全按照我尝试的方式(通过管道之间的异步调用)实现此目的?

奖金问题:

  1. 我猜想使用管道的正确方法可能是确保在一次构造整个管道之前完成所有必需的异步调用。 我的猜测正确吗?
  2. 为什么要附加目的地2不会触发流程,而附加目的地1会触发流程?
  3. 如果我用pipe1.resume()stream.resume()替换LINE 2,则代码也同样有效。我猜这将扩展到无限数量的管道。 为什么可以通过在任意管道上调用.resume()来恢复原始流?这份履历与在管道连接过程中发生的履历有何不同(显然行事方式不同)?

1 个答案:

答案 0 :(得分:2)

You are experiencing the node stream variant of Heisenberg's uncertainty principle - the act of observing the stream changes the behavior of the stream.

Before doing anything else, remove the implementation of the through Stream (although very simple, this in itself can influence the behavior). Let's use built-in Passthrough streams, which we know have no side effects:

logger.debug("Attaching pipe 1");
const pipe1 = new PassThrough();
stream.pipe(pipe1);
pipe1.on('data', data => logger.info('DATA in PIPE 1')); 
pipe1.on('end', () => logger.info('END in PIPE 1')); 


// ...

logger.debug("Attaching pipe 2");
const pipe2 = new PassThrough();
pipe1.pipe(pipe2);
pipe2.on('data', data => logger.info('DATA in PIPE 2')); 
pipe2.on('end', () => {
    logger.info('END in PIPE 2');
    resolve();
}); 

Output:

Creating a stream
Attaching pipe 1
DATA in PIPE 1
END in PIPE 1
Attaching pipe 2
END in PIPE 2

So, with no pause/resume statements, this works (it shouldn't hang forever, I'm not sure why you're seeing that behavior); however, there is no data in pipe2. And it certainly didn't wait around or buffer anything.

The issue is that by attaching an on('data') handler (which is something that through also does), you are informing the stream that it has a way to consume data - it does not need to buffer anything. When we add the pipe to pipe2, it does start piping immediately - there's just no data left to pipe, because we already consumed it.

Try commenting out the data handler for pipe1:

//pipe1.on('data', data => logger.info('DATA in PIPE 1'));

Now we get exactly what we'd expect:

Creating a stream
Attaching pipe 1
Attaching pipe 2
DATA in PIPE 2
END in PIPE 1
END in PIPE 2

Now, when we create the read stream, it immediately starts reading (into the buffer); we attach pipe1, which immediately begins piping data (into pipe1's internal buffer); then we attach pipe2, which immediately begins piping data (into pipe2's internal buffer). You could continue this indefinitely, eventually piping into a write stream and pumping the data to disk or into an HTTP response, etc.