如何使用节点和流解析大型tsv文件

时间:2016-02-03 22:20:48

标签: node.js csv stream

我检索了一个IMDB数据转储(感谢http://www.omdbapi.com/和一个小捐赠)作为TSV文件(包含1,111,073行)。每行代表一部电影,它们看起来像这样:

ID  imdbID  Title   Year    Rating  Runtime Genre   Released    Director    Writer  Cast    Metacritic  imdbRating  imdbVotes   Poster  Plot    FullPlot    Language    Country Awards  lastUpdated
1   tt0000001   Carmencita  1894    NOT RATED   1 min   Documentary, Short      William K.L. Dickson        Carmencita      5.8 1100    http://ia.media-imdb.com/images/M/MV5BMjAzNDEwMzk3OV5BMl5BanBnXkFtZTcwOTk4OTM5Ng@@._V1_SX300.jpg    Performing on what looks like a small wooden stage, wearing a dress with a hoop skirt and white high-heeled pumps, Carmencita does a dance with kicks and twirls, a smile always on her face.   Performing on what looks like a small wooden stage, wearing a dress with a hoop skirt and white high-heeled pumps, Carmencita does a dance with kicks and twirls, a smile always on her face.       USA     2015-12-10 01:09:33.043000000

我的目标是随着时间的推移可视化电影长度的演变。因此,我需要创建两个数组,一个用于最小/最大值,一个用于每年的平均值(因为Highcharts图表类型“区域和折线图”需要该格式)。所以我编写了一个脚本,对于一个小的子集可以正常工作,但在尝试读取整个文件时会引发错误,而不是意料之外的错误。

我很清楚溪流应该可以帮助解决这个问题,但我的专业知识有限,这个小项目实际上可以帮助我更好地了解溪流......

以下是目前的脚本:

https://gist.github.com/jfix/f79f011ce99d2049613c

如果最好在我的问题中将整个脚本显示为内联,我显然可以添加它。

以下是引发的错误:

$ node each.js
buffer.js:382
    throw new Error('toString failed');
    ^

Error: toString failed
    at Buffer.toString (buffer.js:382:11)
    at StringDecoder.write (string_decoder.js:129:21)
    at Parser._transform (/Users/jakob/Projects/imdb-film-length/node_modules/csv-parse/lib/index.js:154:26)
    at Transform._read (_stream_transform.js:167:10)
    at Transform._write (_stream_transform.js:155:12)
    at doWrite (_stream_writable.js:292:12)
    at writeOrBuffer (_stream_writable.js:278:5)
    at Writable.write (_stream_writable.js:207:11)
    at /Users/jakob/Projects/imdb-film-length/node_modules/csv-parse/lib/index.js:46:14
    at doNTCallback0 (node.js:419:9)

感谢您指出正确的方向......

1 个答案:

答案 0 :(得分:0)

我尝试重新创建你的情况,我只是通过运行来得到同样的错误:

.plist

因此,似乎csv-parse模块使进程耗尽内存,因为回调分配了大量数组。您可能需要为csv-parse模块使用stream api。此处描述了一个示例:http://csv.adaltas.com/parse/examples/