Question

我知道这个问题并不新鲜，但我找不到任何有用的东西。在我的情况下，我有一个20 GB的文件，我需要从中读取随机行。现在我有简单的文件索引，其中包含行号和相应的搜索偏移量。此外，我在读取时禁用缓冲以仅读取所需的行。

这是我的代码：

features_gen = tedlium_random_speech_gen(5000) # just a wrapper for function given above

i = 0
for feature, cls in features_gen:
    if i % 1000 == 0:
        print("Got %d features" % i)

    i += 1

print("Total %d features" % i)

问题在于它非常慢：在我的Mac上读取5000行需要89秒（这里我指向ssd驱动器）。我用来测试代码：

var animations = ['fadeIn', 'fadeInDown', 'slideInUp', 'flipInY', 'bounceInLeft'];
var j;
var tmp = animations.slice(); //copy

var removed = 0;
for (var i = 1; i < 20; i++) {
    j = Math.floor(Math.random() * tmp.length);
    console.log(tmp[j]);
    tmp.splice(j, 1);
    removed++;
    if (animations.length == removed) {
        tmp = animations.slice();
        removed = 0
    }
}

我已经阅读了一些关于文件内存映射的内容，但我真的不明白它是如何工作的：映射本质上是如何工作的，它会加速进程还是没有。

所以主要的问题是什么方法可以加快这个过程？我现在看到的唯一方法是随机读取每行而不是行块。

快速随意读取大文件

0 个答案: