nodejs暂停sax流

时间:2017-06-27 14:57:51

标签: javascript node.js xml stream sax

我使用sax解析超大型XML文件。 我正在从我的XML文件创建一个readStream并将其管道化为sax,如下所示:

this.sourceStream = fs.createReadStream(file);
this.sourceStream
    .pipe(this.saxStream);

我正在听一些像这样的事件:

this.saxStream.on("error", (err) => {
    logger.error(`Error during XML Parsing`, err);
});
this.saxStream.on("opentag", (node) => {
    // doing some stuff
});
this.saxStream.on("text", (t) => {
    // doing some stuff
});
this.saxStream.on("closetag", () => {
    if( this.current_element.parent === null ) {
        this.sourceStream.pause();
        this.process_company_information(this.current_company, (err) => {
            if( err ) {
                logger.error("An error appeared while parsing company", err);
            }
            this.sourceStream.resume();
        });
    }
    else {
        this.current_element = this.current_element.parent;
    }
});
this.saxStream.on("end", () => {
    logger.info("Finished reading through stream");
});

特定结束标记进入sax流后,流需要暂停,需要处理当前元素,然后流可以继续。 正如你在我的代码中看到的那样,我试图暂停sourceStream,但是我发现暂停readStream如果是管道传输将不起作用。

所以我的一般问题是在处理当前已解析的元素之前,如何让sax解析器暂停?

我已经阅读了关于拆除和暂停然后再次管道并恢复的信息,这真的是这样做的方式吗,它还可靠吗?

为了更好地说明这里有一些日志:

debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: Done with root tag, can continue stream
debug: Done with root tag, can continue stream
debug: Done with root tag, can continue stream
debug: Done with root tag, can continue stream

我真正想要的是这样的日志:

debug: New root tag found
debug: Done with root tag, can continue stream
debug: New root tag found
debug: Done with root tag, can continue stream
debug: New root tag found
debug: Done with root tag, can continue stream
debug: New root tag found
debug: Done with root tag, can continue stream
debug: New root tag found

在它的当前状态下,sax比处理器快得多,而不是暂停流会因此导致内存问题。

2 个答案:

答案 0 :(得分:1)

sax 目前没有被积极维护 (https://github.com/isaacs/sax-js/issues/238)。我建议您迁移到另一个解析器。例如 saxes https://github.com/lddubeau/saxes

您可以使用带有 for-await-ofGenerator (https://nodejs.org/api/stream.html#stream_consuming_readable_streams_with_async_iterators) 的 Iterable 构造,而不是暂停/恢复流。

安装 deps: yarn add emittery saxesnpm install emittery saxes

然后做这样的事情:

import {createReadStream} from 'fs';
import {SaxesParser, SaxesTagPlain} from 'saxes';
import Emittery from 'emittery';

export interface SaxesEvent {
  type: 'opentag' | 'text' | 'closetag' | 'end';
  tag?: SaxesTagPlain;
  text?: string;
}

/**
  * Generator method.
  * Parses one chunk of the iterable input (Readable stream in the string data reading mode).
  * @see https://nodejs.org/api/stream.html#stream_event_data
  * @param iterable Iterable or Readable stream in the string data reading mode.
  * @returns Array of SaxesParser events
  * @throws Error if a SaxesParser error event was emitted.
  */
async function *parseChunk(iterable: Iterable<string> | Readable): AsyncGenerator<SaxesEvent[], void, undefined> {
  const saxesParser = new SaxesParser<{}>();
  let error;
  saxesParser.on('error', _error => {
    error = _error;
  });

  // As a performance optimization, we gather all events instead of passing
  // them one by one, which would cause each event to go through the event queue
  let events: SaxesEvent[] = [];
  saxesParser.on('opentag', tag => {
    events.push({
      type: 'opentag',
      tag
    });
  });

  saxesParser.on('text', text => {
    events.push({
      type: 'text',
      text
    });
  });

  saxesParser.on('closetag', tag => {
    events.push({
      type: 'closetag',
      tag
    });
  });

  for await (const chunk of iterable) {
    saxesParser.write(chunk as string);
    if (error) {
      throw error;
    }

    yield events;
    events = [];
  }

  yield [{
    type: 'end'
  }];
}

const eventEmitter = new Emittery();
eventEmitter.on('text', async (text) => {
  console.log('Start');
  await new Promise<void>(async (resolve) => {
    await new Promise<void>((resolve1) => {
      console.log('First Level Promise End');
      resolve1();
    });
    console.log('Second Level Promise End');
    resolve();
  });
});

const readable = createReadStream('./some-file.xml');
// Enable string reading mode
readable.setEncoding('utf8');
// Read stream chunks
for await (const saxesEvents of parseChunk(iterable) ?? []) {
  // Process batch of events
  for (const saxesEvent of saxesEvents ?? []) {
    // Emit ordered events and process them in the event handlers strictly one-by-one
    // See https://github.com/sindresorhus/emittery#emitserialeventname-data
    await eventEmitter.emitSerial(event.type, event.tag || event.text);
  }
}

另请查看此解决方案的主要讨论https://github.com/lddubeau/saxes/issues/32

答案 1 :(得分:0)

对于将来遇到类似问题的人来说,这是我尝试过的最终方法,即使它有点像解决方法。

我试图在尝试pause-streampass-stream之间管道一个暂停流,因为那些应该在暂停时缓冲。出于某种原因,这再一次没有改变行为。

最后我决定通过它的根来解决这个问题,而不是创建一个ReadingStream并将它管道化为sax,我使用line-by-line从XML批量读取行并写入sax解析器。现在可以正确暂停此线读取过程,最终帮助我实现所需的行为