Question

我一直在使用Node.JS的内置readline库。我正在对大型CSV文件进行批量处理。我如何一次阅读k行？

这是一个愚蠢的方法：

import { createInterface } from 'readline';
import { createReadStream } from 'fs';

const processLargeFile = (rl, k, callback) => {
    const a = [];
    const drainAndProcess = () => a.splice(0, a.length); // Dummy function

    rl.on('line', line => {
         a.push(line);
         a.length + 1 > k && drainAndProcess();  // Stupid approach
    });

    rl.on('error', error => callback(error));
    rl.on('end', () => callback(void 0, 'finished'));
};

processLargeFile(createInterface({input: createReadStream('huge_file.csv')}), 15,
                 (err, msg) => err == null ? console.info(msg) : throw err);

有什么更好的方法呢？ - 我应该使用%发出事件并保持计数吗？ - 还是其他什么？

Answer 1

对此解决方案并不满意，但它确实有效：

Android-ddp

首先，我担心可以多次调用const raise = throwable => { throw throwable; }; const processTsv = (fname, callback) => { const processTsvLargeFile = (rl, k, callb) => { const a = []; let header; const drainAndProcess = cb => console.info( a.map(row => row.reduce((o, v, i) => Object.assign(o, { [header[i]]: v }), {})), '\nlen =', a.length) || a.splice(0, a.length) && cb(); rl.on('line', line => { typeof header === 'undefined' ? header = line.split('\t') : a.push(line.split('\t')); a.length + 1 > k && drainAndProcess(er => er != null && rl.emit(er)); }); rl.on('error', error => callb(error)); rl.on('end', () => drainAndProcess(err => err != null ? callb(err) : callb(void 0, 'finished'))); }; processTsvLargeFile(createInterface({ input: createReadStream(fname) }), 50, (err, msg) => err == null ? callback(void 0, msg) : raise(err)); };。

Answer 2

除了我在评论中提到的关于竞争条件不是问题的你应该担心，在我测试你的方法时，我发现你的代码有一些错误。
（另外，我必须为我的环境将导入更改为const，但取决于你的，你可能不会）

您有一个回调函数可以记录流是“已完成”还是有错误，而不是为什么不使用该回调来获取您已阅读的k行？这不是一个复杂的变化。

数字对应于代码中需要更改的位置。

添加了对您的回调函数的调用，该调用将传递到processLargeFile
我找不到阅读文档中列出的'end'事件，因此我将其更改为'close'并包含对drainAndProcess的另一个调用，以便剩余的行也将是处理。
您的回调将错误和消息作为参数，但现在因为它也采用了结果

希望这有帮助。

const 
    { createInterface }         = require('readline'),
    { createReadStream }        = require('fs')
;

const processLargeFile = (rl, k, callback) => {
    const a = [];
    const drainAndProcess = () =>{ 
        callback(null, a.splice(0, a.length)); //1
    }

    rl.on('line', line => {
        a.push(line);
        a.length == k && drainAndProcess();  
    });

    rl.on('error', error => callback(error, null));
    rl.on('close', () => drainAndProcess()); // 2
};

processLargeFile(createInterface({
            input: createReadStream('huge_file.csv')
        }), 
        15,
        (err, result) => console.info(err || result) // 3
    );

这样做的好处是，现在processLargeFile可以将任何回调作为参数，并将每k行发送到所述回调。

Answer 3

另一种选择 - 这次使用fs.createReadStream; through2-concurrent;和stream.Transform：

import { Transform, TransformOptions } from 'stream';
import { createReadStream } from 'fs';
import * as through2Concurrent from 'through2-concurrent';
import { map } from 'async';

class ParseHandleTsvTransform extends Transform {
    private header: string[];
    private lengths: number[] = [];
    private iterator: (parsed_rows: Array<{}>, cb: (err: Error) => void) => void;

    constructor(options: TransformOptions, iterator: (parsed_rows: Array<{}>,
                cb: (err: Error) => void) => void) {
        super(options);
        this.iterator = iterator;
    }

    public _transform(chunk: Buffer, encoding: string,
                      callb: (err?: Error, res?: Buffer | string) => void) {
        if (!Buffer.isBuffer(chunk))
            return callb(new TypeError(`Expected buffer got: ${typeof chunk}`));
        const rows: string[][] = chunk.toString('utf-8').split('\n').map(
                                                           row => row.split('\t'));
        if (typeof this.header === 'undefined') {
            this.header = rows.shift();
            return callb();
        } else {
            const parsed_rows: Array<{}> = rows.map(row =>
               row.reduce((o, v, i) => Object.assign(o, {[this.header[i]]: v}),{});
            map(parsed_rows, this.iterator, (e: Error) => callb(e));
            // this.iterator(parsed_rows, (e: Error) => callb(e));
        }
    }
}

使用方法：

const processTsv = (fname, callback) =>
    createReadStream(fname)
        .pipe(new ParseHandleTsvTransform({}, asyncDrainAndProcess))
        .pipe(through2Concurrent.obj(
            { maxConcurrency: 10 },
            (chunk, enc, callb) => callb()))
        /* alt: call over here ^, the non-parsing (processing) func: 
           `asyncDrainAndProcess`,
           potentially using the full pattern with `.on('data')` */
        .on('end', () => callback());

或者充实更简单的解决方案：

const processTsv = (fname: string, callback: (e?: Error, r?: string) => void) =>
    createReadStream(fname)
        .pipe(through2Concurrent.obj(
            { maxConcurrency: 10 },
            (chunk: Buffer, enc: string, callb: (error?: Error) => void) => {
                if (!Buffer.isBuffer(chunk)) return callb(
                    new TypeError(`Expected buffer got: ${typeof chunk}`));

                const rows: string[][] = chunk.toString('utf-8').split('\n').map(
                                                           row => row.split('\t'));
                if (typeof header === 'undefined')
                    header = rows.shift();

                const parsed_rows: Array<{}> = rows.map(row =>
                  row.reduce((o, v, i) => Object.assign(o, {[header[i]]: v}), {}));

                map(parsed_rows, asyncDrainAndProcess, (e: Error, r) => callb(e));
            }))
        .on('end', () => callback());

node.js readline：一次读取k行？

3 个答案: