如何在非GPU本地计算机上使用horovod运行tensorflow进行调试?

时间:2019-05-14 23:22:38

标签: tensorflow horovod

在创建错误之前,我会在这里提出一个好的开始。

我和我们大多数人一样,在本地计算机上调试代码。我希望horovod能够像正常的tensorflow一样在本地机器上运行(安装了所有依赖项)以调试模型。即使如此,我也找不到解决方法。

我一直在尝试对README页面中的基本脚本进行一些修改:

将tensorflow导入为tf 将horovod.tensorflow导入为hvd

hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

# Build model...
params = tf.get_variable("params", [1, 2, 3, 4])
loss = tf.math.reduce_mean(params)
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())

# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)

# Add hook to broadcast variables from rank 0 to all other processes during
# initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]

# Make training operation
train_op = opt.minimize(loss)

# Savepipts only on worker 0 to prevent other workers from corrupting them.
checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None

# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
                                       config=config,
                                       hooks=hooks) as mon_sess:
  while not mon_sess.should_stop():
    # Perform synchronous training.
    mon_sess.run(train_op)

导致的结果:

ValueError: Variable params already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:

  File "<ipython-input-5-057c766a5c1e>", line 6, in <module>
    params = tf.get_variable("params", [1, 2, 3, 4])
  File "/Users/yauheni/env/dev/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3296, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/Users/yauheni/env/dev/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3214, in run_ast_nodes
    if (yield from self.run_code(code, result)):

0 个答案:

没有答案