Question

TF Map function supports parallel calls。我没有看到将num_parallel_calls传递给地图的任何改进。使用num_parallel_calls=1和num_parallel_calls=10，性能运行时间没有任何改进。这是一个简单的代码

import time
def test_two_custom_function_parallelism(num_parallel_calls=1, batch=False, 
    batch_size=1, repeat=1, num_iterations=10):
    tf.reset_default_graph()
    start = time.time()
    dataset_x = tf.data.Dataset.range(1000).map(lambda x: tf.py_func(
        squarer, [x], [tf.int64]), 
        num_parallel_calls=num_parallel_calls).repeat(repeat)
    if batch:
        dataset_x = dataset_x.batch(batch_size)
    dataset_y = tf.data.Dataset.range(1000).map(lambda x: tf.py_func(
       squarer, [x], [tf.int64]), num_parallel_calls=num_parallel_calls).repeat(repeat)
    if batch:
        dataset_y = dataset_x.batch(batch_size)
        X = dataset_x.make_one_shot_iterator().get_next()
        Y = dataset_x.make_one_shot_iterator().get_next()

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        i = 0
        while True:
            try:
                res = sess.run([X, Y])
                i += 1
                if i == num_iterations:
                    break
            except tf.errors.OutOfRangeError as e:
                pass

以下是时间

%timeit test_two_custom_function_parallelism(num_iterations=1000, 
 num_parallel_calls=2, batch_size=2, batch=True)
370ms

%timeit test_two_custom_function_parallelism(num_iterations=1000, 
 num_parallel_calls=5, batch_size=2, batch=True)
372ms

%timeit test_two_custom_function_parallelism(num_iterations=1000, 
 num_parallel_calls=10, batch_size=2, batch=True)
384ms

我在Juypter笔记本中使用了%timeit。我做错了什么？

Answer 1

这里的问题是Dataset.map()函数中唯一的操作是tf.py_func()操作。此操作调用回本地Python解释器以在同一进程中运行函数。增加num_parallel_calls将增加尝试同时回调到Python的TensorFlow线程的数量。但是，Python有一个名为"Global Interpreter Lock"的东西，它可以防止多个线程同时执行代码。因此，除了其中一个并行调用之外，其他所有调用都将被阻止，等待获取全局解释器锁定，并且几乎没有并行加速（甚至可能略微减速）。

您的代码示例未包含squarer()函数的定义，但可以使用纯CensorFlow操作替换tf.py_func()，这些操作在C ++中实现，并且可以并行执行。例如 - 只是通过名称猜测 - 您可以用tf.square(x)的调用替换它，然后您可能会享受一些并行加速。

但请注意，如果函数中的工作量很小，例如平方一个整数，则加速可能不会很大。并行Dataset.map()对于较重的操作更有用，例如使用tf.parse_single_example()解析TFRecord或将数据失真作为数据增强管道的一部分执行。

Answer 2

原因也许是平方器花费的时间少于开销时间。我修改了代码，添加了四分之一函数，耗时2秒。然后，参数num_parallel_calls会按预期工作。这是完整的代码：

import tensorflow as tf
import time
def squarer(x):
  t0 = time.time()
  while time.time() - t0 < 2:
    y = x ** 2
  return y

def test_two_custom_function_parallelism(num_parallel_calls=1,
                                         batch=False,
                                         batch_size=1,
                                         repeat=1,
                                         num_iterations=10):
  tf.reset_default_graph()
  start = time.time()
  dataset_x = tf.data.Dataset.range(1000).map(
      lambda x: tf.py_func(squarer, [x], [tf.int64]),
      num_parallel_calls=num_parallel_calls).repeat(repeat)
  # dataset_x = dataset_x.prefetch(4)
  if batch:
    dataset_x = dataset_x.batch(batch_size)
  dataset_y = tf.data.Dataset.range(1000).map(
      lambda x: tf.py_func(squarer, [x], [tf.int64]),
      num_parallel_calls=num_parallel_calls).repeat(repeat)
  # dataset_y = dataset_y.prefetch(4)
  if batch:
    dataset_y = dataset_x.batch(batch_size)
    X = dataset_x.make_one_shot_iterator().get_next()
    Y = dataset_x.make_one_shot_iterator().get_next()

  with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    i = 0
    while True:
      t0 = time.time()
      try:
        res = sess.run([X, Y])
        print(res)
        i += 1
        if i == num_iterations:
          break
      except tf.errors.OutOfRangeError as e:
        print(i)
        break
      print('step elapse: %.4f' % (time.time() - t0))
  print('total time: %.4f' % (time.time() - start))


test_two_custom_function_parallelism(
    num_iterations=4, num_parallel_calls=1, batch_size=2, batch=True, repeat=10)
test_two_custom_function_parallelism(
    num_iterations=4, num_parallel_calls=10, batch_size=2, batch=True, repeat=10)

输出为：

[(array([0, 1]),), (array([0, 1]),)]
step elapse: 4.0204
[(array([4, 9]),), (array([4, 9]),)]
step elapse: 4.0836
[(array([16, 25]),), (array([16, 25]),)]
step elapse: 4.1529
[(array([36, 49]),), (array([36, 49]),)]
total time: 16.3374
[(array([0, 1]),), (array([0, 1]),)]
step elapse: 2.2139
[(array([4, 9]),), (array([4, 9]),)]
step elapse: 0.0585
[(array([16, 25]),), (array([16, 25]),)]
step elapse: 0.0469
[(array([36, 49]),), (array([36, 49]),)]
total time: 2.5317

因此，我对@mrry提到的“全局解释器锁定”的效果感到困惑。

Answer 3

我设置了自己的map版本，以获取与TensorFlow的Dataset.map类似的东西，但是它将为py_function使用多个CPU。

用法

代替

mapped_dataset = my_dataset.map(lambda x: tf.py_function(my_function, [x], [tf.float64]), num_parallel_calls=16)

使用下面的代码，您可以使用

获得CPU并行py_function版本

mapped_dataset = map_py_function_to_dataset(my_dataset, my_function, number_of_parallel_calls=16)

（如果py_function不是单个tf.float32，也可以指定py_function的输出类型）

在内部，这将创建multiprocessing个工作人员池。它仍然使用单个常规GIL受限的TensorFlow map，但只是将输入传递给工作程序并取回输出。处理数据的工作人员在CPU上并行发生。

注意事项

传递的函数需要为picklable才能与multiprocessing池一起使用。这在大多数情况下都应该起作用，但是某些关闭或其他操作可能会失败。像dill这样的软件包可能会放宽此限制，但我没有对此进行研究。

如果将对象的方法作为函数传递，则还需要注意如何跨进程复制对象（每个进程将拥有其自己的对象副本，因此您不能依赖于共享的属性））。

只要牢记这些注意事项，此代码就可以在许多情况下使用。

代码

"""
Code for TensorFlow's `Dataset` class which allows for multiprocessing in CPU map functions.
"""
import multiprocessing
from typing import Callable, Union, List
import signal
import tensorflow as tf


class PyMapper:
    """
    A class which allows for mapping a py_function to a TensorFlow dataset in parallel on CPU.
    """
    def __init__(self, map_function: Callable, number_of_parallel_calls: int):
        self.map_function = map_function
        self.number_of_parallel_calls = number_of_parallel_calls
        self.pool = multiprocessing.Pool(self.number_of_parallel_calls, self.pool_worker_initializer)

    @staticmethod
    def pool_worker_initializer():
        """
        Used to initialize each worker process.
        """
        # Corrects bug where worker instances catch and throw away keyboard interrupts.
        signal.signal(signal.SIGINT, signal.SIG_IGN)

    def send_to_map_pool(self, element_tensor):
        """
        Sends the tensor element to the pool for processing.

        :param element_tensor: The element to be processed by the pool.
        :return: The output of the map function on the element.
        """
        result = self.pool.apply_async(self.map_function, (element_tensor,))
        mapped_element = result.get()
        return mapped_element

    def map_to_dataset(self, dataset: tf.data.Dataset,
                       output_types: Union[List[tf.dtypes.DType], tf.dtypes.DType] = tf.float32):
        """
        Maps the map function to the passed dataset.

        :param dataset: The dataset to apply the map function to.
        :param output_types: The TensorFlow output types of the function to convert to.
        :return: The mapped dataset.
        """
        def map_py_function(*args):
            """A py_function wrapper for the map function."""
            return tf.py_function(self.send_to_map_pool, args, output_types)
        return dataset.map(map_py_function, self.number_of_parallel_calls)


def map_py_function_to_dataset(dataset: tf.data.Dataset, map_function: Callable, number_of_parallel_calls: int,
                               output_types: Union[List[tf.dtypes.DType], tf.dtypes.DType] = tf.float32
                               ) -> tf.data.Dataset:
    """
    A one line wrapper to allow mapping a parallel py function to a dataset.

    :param dataset: The dataset whose elements the mapping function will be applied to.
    :param map_function: The function to map to the dataset.
    :param number_of_parallel_calls: The number of parallel calls of the mapping function.
    :param output_types: The TensorFlow output types of the function to convert to.
    :return: The mapped dataset.
    """
    py_mapper = PyMapper(map_function=map_function, number_of_parallel_calls=number_of_parallel_calls)
    mapped_dataset = py_mapper.map_to_dataset(dataset=dataset, output_types=output_types)
    return mapped_dataset

并行性不会减少数据集映射的时间

3 个答案: