node.js readline:一次读取k行?

时间:2017-07-12 15:09:41

标签: javascript node.js asynchronous readline large-files

我一直在使用Node.JS的内置readline库。我正在对大型CSV文件进行批量处理。我如何一次阅读k行?

这是一个愚蠢的方法:

import { createInterface } from 'readline';
import { createReadStream } from 'fs';

const processLargeFile = (rl, k, callback) => {
    const a = [];
    const drainAndProcess = () => a.splice(0, a.length); // Dummy function

    rl.on('line', line => {
         a.push(line);
         a.length + 1 > k && drainAndProcess();  // Stupid approach
    });

    rl.on('error', error => callback(error));
    rl.on('end', () => callback(void 0, 'finished'));
};

processLargeFile(createInterface({input: createReadStream('huge_file.csv')}), 15,
                 (err, msg) => err == null ? console.info(msg) : throw err);

有什么更好的方法呢? - 我应该使用%发出事件并保持计数吗? - 还是其他什么?

3 个答案:

答案 0 :(得分:0)

对此解决方案并不满意,但它确实有效:

Android-ddp

首先,我担心可以多次调用const raise = throwable => { throw throwable; }; const processTsv = (fname, callback) => { const processTsvLargeFile = (rl, k, callb) => { const a = []; let header; const drainAndProcess = cb => console.info( a.map(row => row.reduce((o, v, i) => Object.assign(o, { [header[i]]: v }), {})), '\nlen =', a.length) || a.splice(0, a.length) && cb(); rl.on('line', line => { typeof header === 'undefined' ? header = line.split('\t') : a.push(line.split('\t')); a.length + 1 > k && drainAndProcess(er => er != null && rl.emit(er)); }); rl.on('error', error => callb(error)); rl.on('end', () => drainAndProcess(err => err != null ? callb(err) : callb(void 0, 'finished'))); }; processTsvLargeFile(createInterface({ input: createReadStream(fname) }), 50, (err, msg) => err == null ? callback(void 0, msg) : raise(err)); };

答案 1 :(得分:0)

除了我在评论中提到的关于竞争条件不是问题的你应该担心,在我测试你的方法时,我发现你的代码有一些错误。
(另外,我必须为我的环境将导入更改为const,但取决于你的,你可能不会)

您有一个回调函数可以记录流是“已完成”还是有错误,而不是为什么不使用该回调来获取您已阅读的k行?这不是一个复杂的变化。

数字对应于代码中需要更改的位置。

  1. 添加了对您的回调函数的调用,该调用将传递到processLargeFile
  2. 我找不到阅读文档中列出的'end'事件,因此我将其更改为'close'并包含对drainAndProcess的另一个调用,以便剩余的行也将是处理。
  3. 您的回调将错误和消息作为参数,但现在因为它也采用了结果
  4. 希望这有帮助。

    const 
        { createInterface }         = require('readline'),
        { createReadStream }        = require('fs')
    ;
    
    const processLargeFile = (rl, k, callback) => {
        const a = [];
        const drainAndProcess = () =>{ 
            callback(null, a.splice(0, a.length)); //1
        }
    
        rl.on('line', line => {
            a.push(line);
            a.length == k && drainAndProcess();  
        });
    
        rl.on('error', error => callback(error, null));
        rl.on('close', () => drainAndProcess()); // 2
    };
    
    processLargeFile(createInterface({
                input: createReadStream('huge_file.csv')
            }), 
            15,
            (err, result) => console.info(err || result) // 3
        );
    

    这样做的好处是,现在processLargeFile可以将任何回调作为参数,并将每k行发送到所述回调。

答案 2 :(得分:0)

另一种选择 - 这次使用fs.createReadStream; through2-concurrent;和stream.Transform

import { Transform, TransformOptions } from 'stream';
import { createReadStream } from 'fs';
import * as through2Concurrent from 'through2-concurrent';
import { map } from 'async';

class ParseHandleTsvTransform extends Transform {
    private header: string[];
    private lengths: number[] = [];
    private iterator: (parsed_rows: Array<{}>, cb: (err: Error) => void) => void;

    constructor(options: TransformOptions, iterator: (parsed_rows: Array<{}>,
                cb: (err: Error) => void) => void) {
        super(options);
        this.iterator = iterator;
    }

    public _transform(chunk: Buffer, encoding: string,
                      callb: (err?: Error, res?: Buffer | string) => void) {
        if (!Buffer.isBuffer(chunk))
            return callb(new TypeError(`Expected buffer got: ${typeof chunk}`));
        const rows: string[][] = chunk.toString('utf-8').split('\n').map(
                                                           row => row.split('\t'));
        if (typeof this.header === 'undefined') {
            this.header = rows.shift();
            return callb();
        } else {
            const parsed_rows: Array<{}> = rows.map(row =>
               row.reduce((o, v, i) => Object.assign(o, {[this.header[i]]: v}),{});
            map(parsed_rows, this.iterator, (e: Error) => callb(e));
            // this.iterator(parsed_rows, (e: Error) => callb(e));
        }
    }
}

使用方法:

const processTsv = (fname, callback) =>
    createReadStream(fname)
        .pipe(new ParseHandleTsvTransform({}, asyncDrainAndProcess))
        .pipe(through2Concurrent.obj(
            { maxConcurrency: 10 },
            (chunk, enc, callb) => callb()))
        /* alt: call over here ^, the non-parsing (processing) func: 
           `asyncDrainAndProcess`,
           potentially using the full pattern with `.on('data')` */
        .on('end', () => callback());

或者充实更简单的解决方案:

const processTsv = (fname: string, callback: (e?: Error, r?: string) => void) =>
    createReadStream(fname)
        .pipe(through2Concurrent.obj(
            { maxConcurrency: 10 },
            (chunk: Buffer, enc: string, callb: (error?: Error) => void) => {
                if (!Buffer.isBuffer(chunk)) return callb(
                    new TypeError(`Expected buffer got: ${typeof chunk}`));

                const rows: string[][] = chunk.toString('utf-8').split('\n').map(
                                                           row => row.split('\t'));
                if (typeof header === 'undefined')
                    header = rows.shift();

                const parsed_rows: Array<{}> = rows.map(row =>
                  row.reduce((o, v, i) => Object.assign(o, {[header[i]]: v}), {}));

                map(parsed_rows, asyncDrainAndProcess, (e: Error, r) => callb(e));
            }))
        .on('end', () => callback());