我使用sax解析超大型XML文件。 我正在从我的XML文件创建一个readStream并将其管道化为sax,如下所示:
this.sourceStream = fs.createReadStream(file);
this.sourceStream
.pipe(this.saxStream);
我正在听一些像这样的事件:
this.saxStream.on("error", (err) => {
logger.error(`Error during XML Parsing`, err);
});
this.saxStream.on("opentag", (node) => {
// doing some stuff
});
this.saxStream.on("text", (t) => {
// doing some stuff
});
this.saxStream.on("closetag", () => {
if( this.current_element.parent === null ) {
this.sourceStream.pause();
this.process_company_information(this.current_company, (err) => {
if( err ) {
logger.error("An error appeared while parsing company", err);
}
this.sourceStream.resume();
});
}
else {
this.current_element = this.current_element.parent;
}
});
this.saxStream.on("end", () => {
logger.info("Finished reading through stream");
});
特定结束标记进入sax流后,流需要暂停,需要处理当前元素,然后流可以继续。
正如你在我的代码中看到的那样,我试图暂停sourceStream
,但是我发现暂停readStream如果是管道传输将不起作用。
所以我的一般问题是在处理当前已解析的元素之前,如何让sax解析器暂停?
我已经阅读了关于拆除和暂停然后再次管道并恢复的信息,这真的是这样做的方式吗,它还可靠吗?
为了更好地说明这里有一些日志:
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: New root tag found
debug: Done with root tag, can continue stream
debug: Done with root tag, can continue stream
debug: Done with root tag, can continue stream
debug: Done with root tag, can continue stream
我真正想要的是这样的日志:
debug: New root tag found
debug: Done with root tag, can continue stream
debug: New root tag found
debug: Done with root tag, can continue stream
debug: New root tag found
debug: Done with root tag, can continue stream
debug: New root tag found
debug: Done with root tag, can continue stream
debug: New root tag found
在它的当前状态下,sax比处理器快得多,而不是暂停流会因此导致内存问题。
答案 0 :(得分:1)
sax
目前没有被积极维护 (https://github.com/isaacs/sax-js/issues/238)。我建议您迁移到另一个解析器。例如 saxes
https://github.com/lddubeau/saxes。
您可以使用带有 for-await-of
和 Generator
(https://nodejs.org/api/stream.html#stream_consuming_readable_streams_with_async_iterators) 的 Iterable
构造,而不是暂停/恢复流。
安装 deps:
yarn add emittery saxes
或 npm install emittery saxes
然后做这样的事情:
import {createReadStream} from 'fs';
import {SaxesParser, SaxesTagPlain} from 'saxes';
import Emittery from 'emittery';
export interface SaxesEvent {
type: 'opentag' | 'text' | 'closetag' | 'end';
tag?: SaxesTagPlain;
text?: string;
}
/**
* Generator method.
* Parses one chunk of the iterable input (Readable stream in the string data reading mode).
* @see https://nodejs.org/api/stream.html#stream_event_data
* @param iterable Iterable or Readable stream in the string data reading mode.
* @returns Array of SaxesParser events
* @throws Error if a SaxesParser error event was emitted.
*/
async function *parseChunk(iterable: Iterable<string> | Readable): AsyncGenerator<SaxesEvent[], void, undefined> {
const saxesParser = new SaxesParser<{}>();
let error;
saxesParser.on('error', _error => {
error = _error;
});
// As a performance optimization, we gather all events instead of passing
// them one by one, which would cause each event to go through the event queue
let events: SaxesEvent[] = [];
saxesParser.on('opentag', tag => {
events.push({
type: 'opentag',
tag
});
});
saxesParser.on('text', text => {
events.push({
type: 'text',
text
});
});
saxesParser.on('closetag', tag => {
events.push({
type: 'closetag',
tag
});
});
for await (const chunk of iterable) {
saxesParser.write(chunk as string);
if (error) {
throw error;
}
yield events;
events = [];
}
yield [{
type: 'end'
}];
}
const eventEmitter = new Emittery();
eventEmitter.on('text', async (text) => {
console.log('Start');
await new Promise<void>(async (resolve) => {
await new Promise<void>((resolve1) => {
console.log('First Level Promise End');
resolve1();
});
console.log('Second Level Promise End');
resolve();
});
});
const readable = createReadStream('./some-file.xml');
// Enable string reading mode
readable.setEncoding('utf8');
// Read stream chunks
for await (const saxesEvents of parseChunk(iterable) ?? []) {
// Process batch of events
for (const saxesEvent of saxesEvents ?? []) {
// Emit ordered events and process them in the event handlers strictly one-by-one
// See https://github.com/sindresorhus/emittery#emitserialeventname-data
await eventEmitter.emitSerial(event.type, event.tag || event.text);
}
}
另请查看此解决方案的主要讨论https://github.com/lddubeau/saxes/issues/32
答案 1 :(得分:0)
对于将来遇到类似问题的人来说,这是我尝试过的最终方法,即使它有点像解决方法。
我试图在尝试pause-stream和pass-stream之间管道一个暂停流,因为那些应该在暂停时缓冲。出于某种原因,这再一次没有改变行为。
最后我决定通过它的根来解决这个问题,而不是创建一个ReadingStream并将它管道化为sax,我使用line-by-line从XML批量读取行并写入sax解析器。现在可以正确暂停此线读取过程,最终帮助我实现所需的行为