我一直在使用Node.JS的内置readline
库。我正在对大型CSV文件进行批量处理。我如何一次阅读k
行?
这是一个愚蠢的方法:
import { createInterface } from 'readline';
import { createReadStream } from 'fs';
const processLargeFile = (rl, k, callback) => {
const a = [];
const drainAndProcess = () => a.splice(0, a.length); // Dummy function
rl.on('line', line => {
a.push(line);
a.length + 1 > k && drainAndProcess(); // Stupid approach
});
rl.on('error', error => callback(error));
rl.on('end', () => callback(void 0, 'finished'));
};
processLargeFile(createInterface({input: createReadStream('huge_file.csv')}), 15,
(err, msg) => err == null ? console.info(msg) : throw err);
有什么更好的方法呢? - 我应该使用%
发出事件并保持计数吗? - 还是其他什么?
答案 0 :(得分:0)
对此解决方案并不满意,但它确实有效:
Android-ddp
首先,我担心可以多次调用const raise = throwable => { throw throwable; };
const processTsv = (fname, callback) => {
const processTsvLargeFile = (rl, k, callb) => {
const a = [];
let header;
const drainAndProcess = cb => console.info(
a.map(row => row.reduce((o, v, i) =>
Object.assign(o, { [header[i]]: v }), {})),
'\nlen =', a.length) || a.splice(0, a.length) && cb();
rl.on('line', line => {
typeof header === 'undefined' ? header = line.split('\t')
: a.push(line.split('\t'));
a.length + 1 > k && drainAndProcess(er => er != null && rl.emit(er));
});
rl.on('error', error => callb(error));
rl.on('end', () => drainAndProcess(err => err != null ? callb(err)
: callb(void 0, 'finished')));
};
processTsvLargeFile(createInterface({ input: createReadStream(fname) }), 50,
(err, msg) => err == null ? callback(void 0, msg) : raise(err));
};
。
答案 1 :(得分:0)
除了我在评论中提到的关于竞争条件不是问题的你应该担心,在我测试你的方法时,我发现你的代码有一些错误。
(另外,我必须为我的环境将导入更改为const,但取决于你的,你可能不会)
您有一个回调函数可以记录流是“已完成”还是有错误,而不是为什么不使用该回调来获取您已阅读的k
行?这不是一个复杂的变化。
数字对应于代码中需要更改的位置。
processLargeFile
'end'
事件,因此我将其更改为'close'
并包含对drainAndProcess
的另一个调用,以便剩余的行也将是处理。希望这有帮助。
const
{ createInterface } = require('readline'),
{ createReadStream } = require('fs')
;
const processLargeFile = (rl, k, callback) => {
const a = [];
const drainAndProcess = () =>{
callback(null, a.splice(0, a.length)); //1
}
rl.on('line', line => {
a.push(line);
a.length == k && drainAndProcess();
});
rl.on('error', error => callback(error, null));
rl.on('close', () => drainAndProcess()); // 2
};
processLargeFile(createInterface({
input: createReadStream('huge_file.csv')
}),
15,
(err, result) => console.info(err || result) // 3
);
这样做的好处是,现在processLargeFile
可以将任何回调作为参数,并将每k行发送到所述回调。
答案 2 :(得分:0)
另一种选择 - 这次使用fs.createReadStream
; through2-concurrent;和stream.Transform
:
import { Transform, TransformOptions } from 'stream';
import { createReadStream } from 'fs';
import * as through2Concurrent from 'through2-concurrent';
import { map } from 'async';
class ParseHandleTsvTransform extends Transform {
private header: string[];
private lengths: number[] = [];
private iterator: (parsed_rows: Array<{}>, cb: (err: Error) => void) => void;
constructor(options: TransformOptions, iterator: (parsed_rows: Array<{}>,
cb: (err: Error) => void) => void) {
super(options);
this.iterator = iterator;
}
public _transform(chunk: Buffer, encoding: string,
callb: (err?: Error, res?: Buffer | string) => void) {
if (!Buffer.isBuffer(chunk))
return callb(new TypeError(`Expected buffer got: ${typeof chunk}`));
const rows: string[][] = chunk.toString('utf-8').split('\n').map(
row => row.split('\t'));
if (typeof this.header === 'undefined') {
this.header = rows.shift();
return callb();
} else {
const parsed_rows: Array<{}> = rows.map(row =>
row.reduce((o, v, i) => Object.assign(o, {[this.header[i]]: v}),{});
map(parsed_rows, this.iterator, (e: Error) => callb(e));
// this.iterator(parsed_rows, (e: Error) => callb(e));
}
}
}
使用方法:
const processTsv = (fname, callback) =>
createReadStream(fname)
.pipe(new ParseHandleTsvTransform({}, asyncDrainAndProcess))
.pipe(through2Concurrent.obj(
{ maxConcurrency: 10 },
(chunk, enc, callb) => callb()))
/* alt: call over here ^, the non-parsing (processing) func:
`asyncDrainAndProcess`,
potentially using the full pattern with `.on('data')` */
.on('end', () => callback());
或者充实更简单的解决方案:
const processTsv = (fname: string, callback: (e?: Error, r?: string) => void) =>
createReadStream(fname)
.pipe(through2Concurrent.obj(
{ maxConcurrency: 10 },
(chunk: Buffer, enc: string, callb: (error?: Error) => void) => {
if (!Buffer.isBuffer(chunk)) return callb(
new TypeError(`Expected buffer got: ${typeof chunk}`));
const rows: string[][] = chunk.toString('utf-8').split('\n').map(
row => row.split('\t'));
if (typeof header === 'undefined')
header = rows.shift();
const parsed_rows: Array<{}> = rows.map(row =>
row.reduce((o, v, i) => Object.assign(o, {[header[i]]: v}), {}));
map(parsed_rows, asyncDrainAndProcess, (e: Error, r) => callb(e));
}))
.on('end', () => callback());