Question

我需要处理具有数百万行的大文件，有时这些文件可能无法容纳在内存中。

为了获得性能并能够处理文件，我试图创建一个tf.dataset管道来并行处理行。

让我们采用以下两个代码块：

1）一条管道：

dataset = tf.data.TextLineDataset(files, compression_type=None, buffer_size=None, num_parallel_reads=None)
dataset = dataset.map(lambda *args: tf.py_function(py_unicode_to_ascii, args, tf.string))
dataset = dataset.map(tf_lower)

2）在函数中破碎了部分的管道：

def getFileContent(files, compression_type=None, buffer_size=None, num_parallel_reads=None)

    return tf.data.TextLineDataset(files, compression_type=compression_type, buffer_size=buffer_size, num_parallel_reads=num_parallel_reads)

def preprocessData(dataset, preprocessors=[]):

    for p in preprocessors:

        if tf.strings.regex_full_match(p.__name__, '^tf_.*'):
            dataset = dataset.map(p)

        elif tf.strings.regex_full_match(p.__name__, '^py_.*'):
            dataset = dataset.map(lambda *args: tf.py_function(p, args, tf.string))

    return dataset

# Text files
files = list_files('/data/tmp/mytext2.txt', include_label=False)

dataset = getFileContent(files)
dataset = preprocessData(dataset, preprocessors=[py_unicode_to_ascii, tf_lower])

我的问题是：

a）方法（1）和方法（2）是否表现相同？我的意思是，保持数据在管道中流动而没有任何类型的功能块/等待。也许像dataset = getFileContent(...).preprocessData(...)

b）我应该使用yield而不是return吗？

在函数中拆分的tf.dataset管道的行为如何？

0 个答案: