Question

我有一个very large dataset我希望保存在couchdb中以便进行搜索。

我希望记录看起来像这样：

{
  "type": "first",
  "name": "ryan",
  "count": 447980
}

由于文本文件比我应该在内存中保存的大，我正在设置一个流读取读卡器，如下所示：

var db = require('./db'),
    readline = require('readline'),
    path = require('path'),
    fs = require('fs');

// simple callback after cradle save
function saveHandler(er, doc){
    if (er) return console.log('Error: ', er);
    console.log(doc);
}

// save record of type, based on line with count & name
function handleCountedLine(type, line){
    return function(line){
        var record = {type:type};
        var i = line.trim().split(' ');
        record.name = i[1].trim();
        record.count = Number(i[0]);
        db.save(record, saveHandler);
    }
}

var handleFirst = handleCountedLine('first');
readline.createInterface({
    input: fs.createReadStream('data/facebook-firstnames-withcount.txt'),
    terminal: false
})
.on('line', handleFirst);

db是一个摇篮db。

在大约40个记录之后，它会减慢到总爬行速度，然后最终耗尽内存。我尝试poolr和node-rate-limiter，使用＆＃34;一次只运行这么多＆＃34; ＆安培; ＆＃34;只允许这么多人在一分钟内运行＆＃34;策略。两者都工作得更好，但它仍然耗尽内存。有没有一个很好的方法来实现这个目标，或者我被困writing it in python？

Answer 1

借助Paulo Machado在Google视频聊天中的精彩帮助，我使用line-by-line做了一个答案，{{3}}是一个使用stream.pause（）＆amp;的简单包装器。 stream.resume（）只允许一次处理一行。我想给他一个功劳，但他没有来这里作答，所以我会把它放在这里。到目前为止，它已经解析了34039条记录。如果崩溃，我会更新答案。

var LineByLineReader = require('line-by-line'),
  path = require('path'),
  db = require('./db')

// line-by-line read file, turn into a couch record
function processFile(type){
  var fname = path.join('data', types[type] + '.txt');
  var lr = new LineByLineReader(fname, {skipEmptyLines: true});

  lr.on('error', function (err) {
    console.log('Error:');
    console.log(err);
  });

  lr.on('record', function (record) {
    console.log('Saved:');
    console.log(record);
  });

  lr.on('line', function (line) {
    lr.pause();
    var record = { type: type };

    if (type == 'full'){
      record.name = line.trim().split(' ');
    }else{
      var i = line.trim().split(' ');
      record.name = i[1].trim();
      record.count = Number(i[0]);
    }

    db.save(record, function(er, res){
      if (er) lr.emit('error', er, record);
      if (res) lr.emit('record', record);
      lr.resume();
    })
  });
}

var types = {
  'first':'facebook-firstnames-withcount',
  'last':'facebook-lastnames-withcount',
  'full':'facebook-names-unique'
};

for (type in types){
  processFile(type);
}

// views for looking things up
db.save('_design/views', require('./views'));

Answer 2

我想couchdb是这里的瓶颈。查看允许您整体插入文档的couchdb's bulk doc api。（您可能不应该尝试一次提交所有数据，但是在数组中累积一堆文档并将其推送到数据库 - 使用stream.pause（）和stream.resume（）来限制文本流）。如果您使用批量API，您将获得efficiency gains by couchdb奖励。

将许多记录保存到nodejs中的couchdb

2 个答案: