如何在nodejs中加载非常大的csv文件?

时间:2018-05-22 13:47:18

标签: node.js csv

我正在尝试将2个大csv加载到nodejs中,第一个大小为257 597 ko,第二个大小为104 330 ko。我正在使用文件系统(fs)和csv模块,这是我的代码:

fs.readFile('path/to/my/file.csv', (err, data) => {
  if (err) console.err(err)
  else {
    csv.parse(data, (err, dataParsed) => {
      if (err) console.err(err)
      else {
        myData = dataParsed
        console.log('csv loaded')
      }
    })
  }
})

经过多年(1-2小时)之后,它只会因此错误消息而崩溃:

<--- Last few GCs --->

[1472:0000000000466170]  4366473 ms: Mark-sweep 3935.2 (4007.3) -> 3935.2 (4007.
3) MB, 5584.4 / 0.0 ms  last resort GC in old space requested
[1472:0000000000466170]  4371668 ms: Mark-sweep 3935.2 (4007.3) -> 3935.2 (4007.
3) MB, 5194.3 / 0.0 ms  last resort GC in old space requested


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 000002BDF12254D9 <JSObject>
    1: stringSlice(aka stringSlice) [buffer.js:590] [bytecode=000000810336DC91 o
ffset=94](this=000003512FC822D1 <undefined>,buf=0000007C81D768B9 <Uint8Array map
 = 00000352A16C4D01>,encoding=000002BDF1235F21 <String[4]: utf8>,start=0,end=263
778854)
    2: toString [buffer.js:664] [bytecode=000000810336D8D9 offset=148](this=0000
007C81D768B9 <Uint8Array map = 00000352A16C4D01>,encoding=000002BDF1...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memo
ry
 1: node::DecodeWrite
 2: node_module_register
 3: v8::internal::FatalProcessOutOfMemory
 4: v8::internal::FatalProcessOutOfMemory
 5: v8::internal::Factory::NewRawTwoByteString
 6: v8::internal::Factory::NewStringFromUtf8
 7: v8::String::NewFromUtf8
 8: std::vector<v8::CpuProfileDeoptFrame,std::allocator<v8::CpuProfileDeoptFrame
> >::vector<v8::CpuProfileDeoptFrame,std::allocator<v8::CpuProfileDeoptFrame> >
 9: v8::internal::wasm::SignatureMap::Find
10: v8::internal::Builtins::CallableFor
11: v8::internal::Builtins::CallableFor
12: v8::internal::Builtins::CallableFor
13: 00000081634043C1

加载了最大的文件,但是另一个节点的内存耗尽。分配更多内存可能很容易,但这里的主要问题是加载时间,尽管文件很大,但似乎很长。那么正确的方法是什么? Python使用pandas btw(3-5秒)非常快速地加载这些csv。

3 个答案:

答案 0 :(得分:5)

fs.readFile会将整个文件加载到内存中,但fs.createReadStream将以您指定大小的块读取文件。

这样可以防止内存不足

答案 1 :(得分:3)

您可能想要流式传输CSV,而不是一次性读取所有内容:

答案 2 :(得分:3)

Stream完美运行,只需3-5秒:

YourType postData = GetPostData();
var person = prepare("", "", postData);