Question

我发现Google NMT使用编解码器来读取输入数据文件。

import codecs
import tensorflow as tf
with codecs.getreader("utf-8")(tf.gfile.GFile(input_file, mode="rb")) as f:
    return f.read().splitlines()

我有两个问题。

上面是否支持在16 GB RAM的个人计算机中读取size more than 5 GB左右的大型数据集而没有内存错误，因为它使用的是tf.gfile.GFile？我真的很感激能够帮助我阅读庞大语言语料库的解决方案

没有收到内存错误

。 2.我在代码中导入了编解码器但为什么我收到此错误"NameError: name 'codecs' is not defined“？

编辑1：

对于2.获取

 OutOfRangeError                           Traceback (most recent call last)
    <ipython-input-7-e78786c1f151> in <module>()
          6 input_file = os.path.join(source_path)
          7 with codecs.getreader("utf-8")(tf.gfile.GFile(input_file, mode="rb")) as f:
    ----> 8     source_text = f.read().splitlines()

当操作迭代超过有效输入范围时，将引发OutOfRangeError。我怎样才能解决这个问题？

Answer 1

如果文件大小非常大，建议逐行处理。下面的代码可以解决问题：

with open("input_file") as infile:
    for line in infile:
        do_something_with(line)

在16GB RAM计算机中读取没有内存错误的大语言语料库

1 个答案: