Question

我有一个在CPU上动态生成的数据集。样本在python中由函数make_sample计算，该函数非常复杂，并且不能转换为tensorflow操作。因为样本生成很耗时，所以我想从多个线程调用该函数来填充输入队列。

我从example given in the documentation开始，到达以下玩具示例：

import numpy as np
import tensorflow as tf
import time

def make_sample():
  # something that takes time and needs to be on CPU w/o tf ops
  p = 1
  for n in range(1000000):
    p = (p + np.random.random()) * np.random.random()
  return np.float32(p)

read_threads = 1

with tf.device('/cpu:0'):
  example_list = [tf.py_func(make_sample, [], [tf.float32]) for _ in range(read_threads)]
  for ex in example_list:
    ex[0].set_shape(())
  batch_size = 3
  capacity = 30
  batch = tf.train.batch_join(example_list, batch_size=batch_size, capacity=capacity)

with tf.Session().as_default() as sess:
  tf.global_variables_initializer().run()
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(sess=sess, coord=coord)
  try:
    # dry run, left out of timing
    sess.run(batch)
    start_time = time.time()
    for it in range(5):
      print(sess.run(batch))
  finally:
    duration = time.time() - start_time
    print('duration: {0:4.2f}s'.format(duration))
    coord.request_stop()
  coord.join(threads)

让我感到惊讶的是，当增加read_threads时，CPU使用率永远不会超过50％。更糟糕的是，计算时间直线下降：在我的电脑上，

read_threads=1→duration: 12s
read_threads=2→duration: 46s
read_threads=4→duration: 68s
read_threads=8→duration: 112s

是否有解释，最重要的是，在tensorflow上使用自定义python函数生成高效的多线程数据的解决方案？

Answer 1

tf.py_func重用现有的Python解释器。不幸的是，Python支持并发，但不支持并行。换句话说，您可以拥有多个Python线程，但只有一个可以随时执行Python代码。标准解决方案是将生成管道移动到TensorFlow / C ++，或使用多个Python流程和附加层来聚合其结果（即，使用ZMQ聚合来自多个Python流程的结果）

在CPU上预处理期间使用多个线程时，Tensorflow会变慢

1 个答案: