内存效率不断提高的Node.js双工/转换流

时间:2019-06-26 09:49:03

标签: node.js

我正在尝试通过流在特定索引处将变量添加到模板中。

image

这个想法是,我有一个可读流,并且有一个变量列表,这些变量可以是可读流,缓冲区或大小不确定的字符串。可以将这些变量插入预定义的索引列表中。根据我的假设以及到目前为止我所做的尝试,我有几个问题。

我的第一个尝试是使用可读流手动进行操作。但是,在const buffer = templateIn.read(size)尝试读取之前,我无法template combined(因为缓冲区仍然为空)。该问题的解决方案类似于您使用转换流的方式,因此这是我下一步的工作。

但是,我对转换流有问题。我的问题是,伪代码之类的东西会将缓冲区堆积到内存中,直到调用done()为止。

public _transform(chunk: Buffer, encoding: string, done: (err?: Error, data?: any) => void ): void {
    let index = 0;
    while (index < chunk.length) {
        if (index === this.variableIndex) {  // the basic idea (the actual logic is a bit more complex)
            this.insertStreamHere(index);
            index++;
        } else {
            // continue reading stream normally
        }
    }
    done()
}
  

发件人:https://github.com/nodejs/node/blob/master/lib/_stream_transform.js

     

在转换流中,写入的数据放置在缓冲区中。什么时候    _read(n)被调用,它转换排队的数据,调用    缓冲的_write cb,因为它消耗了块。如果只吃一次    写入的块将导致多个输出块,然后是第一个    输出的位调用readcb,随后的块进入    读取缓冲区,并在必要时使它发出“可读”消息。

     

这样,反压实际上是由读取侧确定的,    因为必须调用_read才能开始处理新的块。然而,    病理性膨胀类型的转换会导致过多的缓冲    这里。例如,想象一个流,其中输入的每个字节都是    解释为0-255之间的整数,然后导致那么多    输出字节。写入4个字节{ff,ff,ff,ff}将导致    正在输出1kb的数据。在这种情况下,您可以写一个很小的    大量的输入,最终得到大量的输出。在    这样的病态膨胀机制,无法说明    系统停止进行转换。一个4MB的写入可以    导致系统内存不足。

所以TL; DR :如何在特定索引处插入(大)流,而又不给内存中的缓冲区带来巨大的背压。任何建议表示赞赏。

1 个答案:

答案 0 :(得分:0)

大量阅读文档和源代码后,进行了大量的反复试验和一些测试。我已经为我的问题提出了解决方案。我可以复制并粘贴我的解决方案,但是为了完整起见,我将在这里解释我的发现。

使用管道处理背压包括几个部分。我们有Readable可以将数据写入WritableReadableWritable提供了一个回调,通过它可以告诉Readable准备接收新的数据块。阅读部分比较简单。 Readable有一个内部缓冲区。使用Readable.push()会将数据添加到缓冲区。读取数据时,它将来自此内部缓冲区。紧接着,我们可以使用Readable.readableHighWaterMarkReadable.readableLength来确保不会一次推送大量数据。

Readable.readableHighWaterMark - Readable.readableLength

是我们应推送到此内部缓冲区的最大字节数。

因此,这意味着,由于我们要同时读取两个Readable流,因此需要两个Writable流来控制流。要合并数据,我们需要自己对其进行缓冲,因为据我所知,Writable流中没有内部缓冲区。因此,双工流将是最佳选择,因为我们要处理自身的缓冲,写入和读取。

写作

现在让我们开始编写代码。为了控制多个流的状态,我们将创建一个状态接口。如下所示:

declare type StreamCallback = (error?: Error | null) => void;

interface MergingState {
    callback: StreamCallback;
    queue: BufferList;
    highWaterMark: number;
    size: number;
    finalizing: boolean;
}

callback保留由write或final提供的最后一个回调(我们将在稍后讨论final)。 highWaterMark表示我们的queue的最大大小,而size是我们当前队列的大小。最后,finalizing标志指示当前队列是最后一个队列。因此,一旦队列为空,我们就完成了读取属于该状态的流的操作。

BufferList是用于内置流的internal Nodejs implementation的副本。

如前所述,可写程序处理背压,因此这两个可写程序的通用方法如下:

/**
 * Method to write to provided state if it can
 *
 * (Will unshift the bytes that cannot be written back to the source)
 *
 * @param src the readable source that writes the chunk
 * @param chunk the chunk to be written
 * @param encoding the chunk encoding, currently not used
 * @param cb the streamCallback provided by the writing state
 * @param state the state which should be written to
 */
private writeState(src: Readable, chunk: Buffer, encoding: string, cb: StreamCallback, state: MergingState): void {
    this.mergeNextTick();
    const bytesAvailable = state.highWaterMark - state.size;
    if (chunk.length <= bytesAvailable) {
        // save to write to our local buffer
        state.queue.push(chunk);
        state.size += chunk.length;
        if (chunk.length === bytesAvailable) {
            // our queue is full, so store our callback
            this.stateCallbackAndSet(state, cb);
        } else {
            // we still have some space, so we can call the callback immediately
            cb();
        }
        return;
    }

    if (bytesAvailable === 0) {
        // no space available unshift entire chunk
        src.unshift(chunk);
    } else {
        state.size += bytesAvailable;
        const leftOver = Buffer.alloc(chunk.length - bytesAvailable);
        chunk.copy(leftOver, 0, bytesAvailable);
        // push amount of bytes available
        state.queue.push(chunk.slice(0, bytesAvailable));
        // unshift what we cannot fit in our queue
        src.unshift(leftOver);
    }
    this.stateCallbackAndSet(state, cb);
}

首先,我们检查有多少空间可用于缓冲。如果有足够的空间容纳全部块,我们将对其进行缓冲。如果没有可用空间,我们将缓冲区移至其可读源。如果有可用空间,我们只会取消无法容纳的空间。如果缓冲区已满,我们将存储请求新块的回调。如果有空间,我们将请求下一块。

this.mergeNextTick()之所以被调用是因为我们的状态已更改,因此应在下一个刻度中读取它:

private mergeNextTick(): void {
    if (!this.mergeSync) {
        // make sure it is only called once per tick
        // we don't want to call it multiple times
        // since there will be nothing left to read the second time
        this.mergeSync = true;
        process.nextTick(() => this._read(this.readableHighWaterMark));
    }
}

this.stateCallbackAndSet是一个帮助程序函数,它将仅调用我们的上一个回调以确保我们不会进入使流停止流动的状态。并且会提供新的。

/**
 * Helper function to call the callback if it exists and set the new callback
 * @param state the state which holds the callback
 * @param cb the new callback to be set
 */
private stateCallbackAndSet(state: MergingState, cb: StreamCallback): void {
    if (!state) {
        return;
    }
    if (state.callback) {
        const callback = state.callback;
        // do callback next tick, such that we can't get stuck in a writing loop
        process.nextTick(() => callback());
    }
    state.callback = cb;
}

阅读

现在,在阅读方面,这是我们处理选择正确流的部分。

首先我们的函数读取状态,这很简单。它读取它能够读取的字节数。它返回写入的字节数,这对于我们的其他功能很有用。

/**
 * Method to read the provided state if it can
 *
 * @param size the number of bytes to consume
 * @param state the state from which needs to be read
 * @returns the amount of bytes read
 */
private readState(size: number, state: MergingState): number {
    if (state.size === 0) {
        // our queue is empty so we read 0 bytes
        return 0;
    }
    let buffer = null;
    if (state.size < size) {
        buffer = state.queue.consume(state.size, false);
    } else {
        buffer = state.queue.consume(size, false);
    }
    this.push(buffer);
    this.stateCallbackAndSet(state, null);
    state.size -= buffer.length;
    return buffer.length;
}

doRead方法是所有合并发生的地方:它获取nextMergingIndex。如果合并索引是END,那么我们可以读取writingState直到流结束。如果我们处于合并索引,则从mergingState中读取。否则,我们将从writingState中读取大量内容,直到到达下一个合并索引。

/**
 * Method to read from the correct Queue
 *
 * The doRead method is called multiple times by the _read method until
 * it is satisfied with the returned size, or until no more bytes can be read
 *
 * @param n the number of bytes that can be read until highWaterMark is hit
 * @throws Errors when something goes wrong, so wrap this method in a try catch.
 * @returns the number of bytes read from either buffer
 */
private doRead(n: number): number {
    // first check all constants below 0,
    // which is only Merge.END right now
    const nextMergingIndex = this.getNextMergingIndex();
    if (nextMergingIndex === Merge.END) {
        // read writing state until the end
        return this.readWritingState(n);
    }
    const bytesToNextIndex = nextMergingIndex - this.index;
    if (bytesToNextIndex === 0) {
        // We are at the merging index, thus should read merging queue
        return this.readState(n, this.mergingState);
    }
    if (n <= bytesToNextIndex) {
        // We are safe to read n bytes
        return this.readWritingState(n);
    }
    // read the bytes until the next merging index
    return this.readWritingState(bytesToNextIndex);
}

readWritingState读取状态并更新索引:

/**
 * Method to read from the writing state
 *
 * @param n maximum number of bytes to be read
 * @returns number of bytes written.
 */
private readWritingState(n: number): number {
    const bytesWritten = this.readState(n, this.writingState);
    this.index += bytesWritten;
    return bytesWritten;
}

合并

要选择要合并的流,我们将使用生成器功能。生成器函数产生一个索引和要在该索引处合并的流:

export interface MergingStream { index: number; stream: Readable; }

doRead中调用getNextMergingIndex()。此函数返回下一个MergingStream的索引。如果没有下一个mergingStream,则调用生成器以获取新的mergingStream。如果没有新的合并流,我们将只返回END

/**
 * Method to get the next merging index.
 *
 * Also fetches the next merging stream if merging stream is null
 *
 * @returns the next merging index, or Merge.END if there is no new mergingStream
 * @throws Error when invalid MergingStream is returned by streamGenerator
 */
private getNextMergingIndex(): number {
    if (!this.mergingStream) {
        this.setNewMergeStream(this.streamGenerator.next().value);
        if (!this.mergingStream) {
            return Merge.END;
        }
    }
    return this.mergingStream.index;
}

setNewMergeStream中,我们创建了一个新的Writable,可以将新的合并流传送到其中。对于我们的Writable,我们将需要处理用于写入状态的write回调和用于处理最后一个块的final回调。我们也不应该忘记重置状态。

/**
 * Method to set the new merging stream
 *
 * @throws Error when mergingStream has an index less than the current index
 */
private setNewMergeStream(mergingStream?: MergingStream): void {
    if (this.mergingStream) {
        throw new Error('There already is a merging stream');
    }
    // Set a new merging stream
    this.mergingStream = mergingStream;
    if (mergingStream == null || mergingStream.index === Merge.END) {
        // set new state
        this.mergingState = newMergingState(this.writableHighWaterMark);
        // We're done, for now...
        // mergingStream will be handled further once nextMainStream() is called
        return;
    }
    if (mergingStream.index < this.index) {
        throw new Error('Cannot merge at ' + mergingStream.index + ' because current index is ' + this.index);
    }
    // Create a new writable our new mergingStream can write to
    this.mergeWriteStream = new Writable({
        // Create a write callback for our new mergingStream
        write: (chunk, encoding, cb) => this.writeMerge(mergingStream.stream, chunk, encoding, cb),
        final: (cb: StreamCallback) => {
            this.onMergeEnd(mergingStream.stream, cb);
        },
    });
    // Create a new mergingState for our new merging stream
    this.mergingState = newMergingState(this.mergeWriteStream.writableHighWaterMark);
    // Pipe our new merging stream to our sink
    mergingStream.stream.pipe(this.mergeWriteStream);
}  

完成

该过程的最后一步是处理我们的最终块。这样我们就知道何时结束合并并可以发送结束块。在我们的主读取循环中,我们首先读取,直到doRead()方法连续两次返回0或已填满我们的读取缓冲区为止。一旦发生这种情况,我们将结束读取循环,并检查状态是否完成。

public _read(size: number): void {
        if (this.finished) {
            // we've finished, there is nothing to left to read
            return;
        }
        this.mergeSync = false;
        let bytesRead = 0;
        do {
            const availableSpace = this.readableHighWaterMark - this.readableLength;
            bytesRead = 0;
            READ_LOOP: while (bytesRead < availableSpace && !this.finished) {
                try {
                    const result = this.doRead(availableSpace - bytesRead);
                    if (result === 0) {
                        // either there is nothing in our buffers
                        // or our states are outdated (since they get updated in doRead)
                        break READ_LOOP;
                    }
                    bytesRead += result;
                } catch (error) {
                    this.emit('error', error);
                    this.push(null);
                    this.finished = true;
                }
            }
        } while (bytesRead > 0 && !this.finished);
        this.handleFinished();
    }

然后在我们的handleFinished()中检查状态。

private handleFinished(): void {
    if (this.finished) {
        // merge stream has finished, so nothing to check
        return;
    }
    if (this.isStateFinished(this.mergingState)) {
        this.stateCallbackAndSet(this.mergingState, null);
        // set our mergingStream to null, to indicate we need a new one
        // which will be fetched by getNextMergingIndex()
        this.mergingStream = null;
        this.mergeNextTick();
    }
    if (this.isStateFinished(this.writingState)) {
        this.stateCallbackAndSet(this.writingState, null);
        this.handleMainFinish(); // checks if there are still mergingStreams left, and sets finished flag
        this.mergeNextTick();
    }
}

isStateFinished()检查我们的状态是否设置了终结标志,队列大小是否等于0

/**
 * Method to check if a specific state has completed
 * @param state the state to check
 * @returns true if the state has completed
 */
private isStateFinished(state: MergingState): boolean {
    if (!state || !state.finalizing || state.size > 0) {
        return false;
    }
    return true;
}

一旦我们的结束回调位于合并Writable流的最终回调中,就设置finalize标志。对于我们的主流,我们必须采取一些不同的方法,因为我们几乎无法控制流何时结束,因为默认情况下可读性调用了可写的结尾。我们想要删除此行为,以便我们可以决定何时完成流。设置其他最终侦听器时,这可能会引起一些问题,但是对于大多数用例来说,应该没问题。

 private onPipe(readable: Readable): void {
    // prevent our stream from being closed prematurely and unpipe it instead
    readable.removeAllListeners('end');  // Note: will cause issues if another end listener is set
    readable.once('end', () => {
        this.finalizeState(this.writingState);
        readable.unpipe();
    });
}

finalizeState()设置标志和回调以结束流。

/**
 * Method to put a state in finalizing mode
 *
 * Finalizing mode: the last chunk has been received, when size is 0
 * the stream should be removed.
 *
 * @param state the state which should be put in finalizing mode
 *
 */
private finalizeState(state: MergingState, cb?: StreamCallback): void {
    state.finalizing = true;
    this.stateCallbackAndSet(state, cb);
    this.mergeNextTick();
}

这就是将多个流合并到一个接收器中的方式。

TL; DR:The complete code

此代码已通过我的jest测试套件在多种边缘情况下进行了全面测试,并且比我的代码中介绍的功能更多。例如附加流并合并到该附加流中。通过提供Merge.END作为索引。

测试结果

您可以看到我在这里进行的测试,如果我忘记了任何测试,请给我发送一条消息,然后我可以为此编写另一个测试

MergeStream
    ✓ should throw an error when nextStream is not implemented (9ms)
    ✓ should throw an error when nextStream returns a stream with lower index (4ms)
    ✓ should reset index after new main stream (5ms)
    ✓ should write a single stream normally (50ms)
    ✓ should be able to merge a stream (2ms)
    ✓ should be able to append a stream on the end (1ms)
    ✓ should be able to merge large streams into a smaller stream (396ms)
    ✓ should be able to merge at the correct index (2ms)

用法

const mergingStream = new Merge({
    *nextStream(): IterableIterator<MergingStream> {
        for (let i = 0; i < 10; i++) {
            const stream = new Readable();
            stream.push(i.toString());
            stream.push(null);
            yield {index: i * 2, stream};
        }
    },
});
const template = new Readable();
template.push(', , , , , , , , , ');
template.push(null);
template.pipe(mergingStream).pipe(getSink());

我们沉沦的结果将是

0, 1, 2, 3, 4, 5, 6, 7, 8, 9

最终想法

这不是最省时的方法,因为我们一次只能管理一个合并缓冲区。因此,有很多等待。就我的用例而言,这很好。我担心它不会耗尽我的记忆,因此该解决方案对我有效。但是肯定有一些优化的空间。完整的代码具有一些此处未完全解释的额外功能,例如附加流和合并到该附加流中。尽管已经用注释解释了它们。我正在考虑将其转换为node_module,或者甚至更好地将其作为下一个Main Node版本的新功能,因为它是非常通用的构建块,具有许多不同的应用程序。我将了解其余节点社区的想法,并在我了解更多信息后更新我的评论。