Question

我有一个培训功能，在这里端到端训练一个tf模型（仅供参考）：

def opt_fx(params, gpu):
    os.environ["CUDA_VISIBLE_DEVICES"] = gpu

    sess = tf.Session()
    # Run some training on a particular gpu...
    sess.run(...)

我希望使用每个gpu的模型在20个试验中运行超参数优化：

from threading import Thread
exp_trials = list(hyperparams.trials(num=20))
train_threads = []
for gpu_num, trial_params in zip(['0', '1', '2', '3']*5, exp_trials):
    t = Thread(target=opt_fx, args=(trial_params, gpu_num,))
    train_threads.append(t)

# Start the threads, and block on their completion.
for t in train_threads:
  t.start()

for t in train_threads:
  t.join()

然而，这失败了...这是正确的方法吗？

Answer 1

我不确定这是否是最好的方法，但我最终做的是定义每台设备的图表并在单独的会话中训练每一个。这可以并行化。我试图在单独的设备中重用该图，但这不起作用。以下是我的版本在代码中的样子（完整示例）：

import threading
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

# Get the data
mnist = input_data.read_data_sets("data/mnist", one_hot=True)
train_x_all = mnist.train.images
train_y_all = mnist.train.labels
test_x = mnist.test.images
test_y = mnist.test.labels

# Define the graphs per device
devices = ['/gpu', '/cpu']        # just one GPU on this machine...
learning_rates = [0.01, 0.03]
jobs = []
for device, learning_rate in zip(devices, learning_rates):
  with tf.Graph().as_default() as graph:
    x = tf.placeholder(tf.float32, [None, 784], name='x')
    y = tf.placeholder(tf.float32, [None, 10], name='y')
    W = tf.Variable(tf.zeros([784, 10]))
    b = tf.Variable(tf.zeros([10]))
    pred = tf.nn.softmax(tf.matmul(x, W) + b)
    accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1)), tf.float32))
    cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1), name='cost')
    optimize = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost, name='optimize')
  jobs.append(graph)

# Train a graph on a device
def train(device, graph):
  print "Start training on %s" % device
  with tf.Session(graph=graph) as session:
    x = graph.get_tensor_by_name('x:0')
    y = graph.get_tensor_by_name('y:0')
    cost = graph.get_tensor_by_name('cost:0')
    optimize = graph.get_operation_by_name('optimize')

    session.run(tf.global_variables_initializer())
    batch_size = 500
    for epoch in range(5):
      total_batch = int(train_x_all.shape[0] / batch_size)
      for i in range(total_batch):
        batch_x = train_x_all[i * batch_size:(i + 1) * batch_size]
        batch_y = train_y_all[i * batch_size:(i + 1) * batch_size]
        _, c = session.run([optimize, cost], feed_dict={x: batch_x, y: batch_y})
        if i % 20 == 0:
          print "Device %s: epoch #%d step=%d cost=%f" % (device, epoch, i, c)

# Start threads in parallel
train_threads = []
for i, graph in enumerate(jobs):
  train_threads.append(threading.Thread(target=train, args=(devices[i], graph)))
for t in train_threads:
  t.start()
for t in train_threads:
  t.join()

请注意，train函数与上下文中graph的张量和操作一起使用，即每个cost和optimize都不同。

这会产生以下输出，表明两个模型是并行训练的：

Start training on /gpu
Start training on /cpu
Device /cpu: epoch #0 step=0 cost=2.302585
Device /cpu: epoch #0 step=20 cost=1.788247
Device /cpu: epoch #0 step=40 cost=1.400490
Device /cpu: epoch #0 step=60 cost=1.271820
Device /gpu: epoch #0 step=0 cost=2.302585
Device /cpu: epoch #0 step=80 cost=1.128214
Device /gpu: epoch #0 step=20 cost=2.105802
Device /cpu: epoch #0 step=100 cost=0.927004
Device /cpu: epoch #1 step=0 cost=0.905336
Device /gpu: epoch #0 step=40 cost=1.908744
Device /cpu: epoch #1 step=20 cost=0.865687
Device /gpu: epoch #0 step=60 cost=1.808407
Device /cpu: epoch #1 step=40 cost=0.754765
Device /gpu: epoch #0 step=80 cost=1.676024
Device /cpu: epoch #1 step=60 cost=0.794201
Device /gpu: epoch #0 step=100 cost=1.513800
Device /gpu: epoch #1 step=0 cost=1.451422
Device /cpu: epoch #1 step=80 cost=0.786958
Device /gpu: epoch #1 step=20 cost=1.415125
Device /cpu: epoch #1 step=100 cost=0.643715
Device /cpu: epoch #2 step=0 cost=0.674683
Device /gpu: epoch #1 step=40 cost=1.273473
Device /cpu: epoch #2 step=20 cost=0.658424
Device /gpu: epoch #1 step=60 cost=1.300150
Device /cpu: epoch #2 step=40 cost=0.593681
Device /gpu: epoch #1 step=80 cost=1.242193
Device /cpu: epoch #2 step=60 cost=0.640543
Device /gpu: epoch #1 step=100 cost=1.105950
Device /gpu: epoch #2 step=0 cost=1.089900
Device /cpu: epoch #2 step=80 cost=0.664947
Device /gpu: epoch #2 step=20 cost=1.088389
Device /cpu: epoch #2 step=100 cost=0.535446
Device /cpu: epoch #3 step=0 cost=0.580295
Device /gpu: epoch #2 step=40 cost=0.983053
Device /cpu: epoch #3 step=20 cost=0.566510
Device /gpu: epoch #2 step=60 cost=1.044966
Device /cpu: epoch #3 step=40 cost=0.518787
Device /gpu: epoch #2 step=80 cost=1.025607
Device /cpu: epoch #3 step=60 cost=0.562461
Device /gpu: epoch #2 step=100 cost=0.897545
Device /gpu: epoch #3 step=0 cost=0.907381
Device /cpu: epoch #3 step=80 cost=0.600475
Device /gpu: epoch #3 step=20 cost=0.911914
Device /cpu: epoch #3 step=100 cost=0.477412
Device /cpu: epoch #4 step=0 cost=0.527233
Device /gpu: epoch #3 step=40 cost=0.827964
Device /cpu: epoch #4 step=20 cost=0.513356
Device /gpu: epoch #3 step=60 cost=0.897128
Device /cpu: epoch #4 step=40 cost=0.474257
Device /gpu: epoch #3 step=80 cost=0.898960
Device /cpu: epoch #4 step=60 cost=0.514083
Device /gpu: epoch #3 step=100 cost=0.774140
Device /gpu: epoch #4 step=0 cost=0.799004
Device /cpu: epoch #4 step=80 cost=0.559898
Device /gpu: epoch #4 step=20 cost=0.802869
Device /cpu: epoch #4 step=100 cost=0.440813
Device /gpu: epoch #4 step=40 cost=0.732562
Device /gpu: epoch #4 step=60 cost=0.801020
Device /gpu: epoch #4 step=80 cost=0.815830
Device /gpu: epoch #4 step=100 cost=0.692840

您可以使用standard MNIST data自行尝试。

如果要调整许多超参数，这并不理想，但是您应该能够创建一个外部循环来迭代可能的超参数元组，将特定的图形分配给设备并运行它们，如上所示。

Answer 2

对于这样的问题，我通常会使用多处理库而不是线程，因为与训练网络相比，多处理的开销很小，但消除了任何GIL问题。我认为这是您的代码的主要问题。您正在为每个线程设置“ CUDA_VISIBLE_DEVICES”环境变量，但是每个线程仍在共享相同的环境，因为它们处于同一进程中。

所以我通常在Tensorflow == 2.1中执行的操作是将GPU ID号传递给工作进程，然后可以运行以下代码来设置可见的GPU

gpus = tf.config.experimental.list_physical_devices('GPU')
my_gpu = gpus[gpu_id]
tf.config.set_visible_devices(my_gpu, 'GPU')

该流程中的Tensorflow现在将仅在该GPU上运行

有时候，您正在训练的网络很小，实际上您可以在一个GPU上一次运行多个网络。为了确保GPU内存可以容纳几个内存，您可以为启动的每个工作程序设置内存限制。

tf.config.set_logical_device_configuration(
    my_gpu,
    [tf.config.LogicalDeviceConfiguration(memory_limit=6000)]
)

但是，如果您设置了内存限制，请记住，Tensorflow在cuDNN或其他限制之外使用了一些额外的内存，因此对于您运行的每个会话，您还需要有一些缓冲内存。通常，我只是尝试尝试一个错误，以查看自己适合什么，所以对不起，我没有更好的数字。

使用tensorflow在并行gpus上运行超参数优化

2 个答案: