Question

我有一个非常大的文本文件，记录超过1500万。这是一个词汇文件，在左侧显示单词，在右侧显示单词的出现情况。我寻找的最终结果是将所有这些单词过滤掉，并且少于200次出现的所有单词都被归类为困难单词。

例如，这是文本文件将在其中包含的内容：

of  12765289150
and 12522922536

左侧的单词和右侧出现的单词之间用制表符分隔。

当前，我所使用的系统向我返回了<=出现的单词200次，这虽然很棒，但要花15秒以上的时间才能读取文件并返回这些值。我如何加快这一过程？

这是我正在运行的代码：

exports.readText = (req, res, next) => {
  const fs = require('fs'),
    es = require('event-stream'),
    path = require("path"),
    filePath = path.join(__dirname, "../documents/vocab_cs");

  const infrequentWords = [];

  let s = fs.createReadStream(filePath)
    .pipe(es.split())
    .pipe(es.mapSync((line) => {
        const lines = line.split('\t');
        const freq = Number(lines[1]);
        if (freq <= 200) {
          infrequentWords.push(lines[0]);
        }
        // pause the readstream
        s.pause();

        // process line here and call s.resume() when rdy

        // resume the readstream, possibly from a callback
        s.resume();
      })
      .on('error', (err) => {
        console.log('Error while reading file.', err);
      })
      .on('end', () => {
        infrequentWordsString = infrequentWords.join(' ');
        res.status(200).json(infrequentWordsString);
        console.log('Read entire file.')
      })
    );
}

任何建议都将受到高度赞赏，因为我在该领域没有太多经验，因此我完全陷在这个问题上！

只需要使整个过程运行得更快，并向我返回这些稀有词<= 200，因为它们会显示在前端，以帮助读者更好地理解文本，但是由于这个延迟，用户不得不等待超过10/15秒。

谢谢！

更快地解析Node.js中的大文件，逐行读取文件，当前运行太慢

0 个答案: