无法使用TensorFlow和GPU运行Ray tune

时间:2019-04-05 08:32:04

标签: tensorflow deep-learning hyperparameters ray

  • OS平台和发行版:Linux Ubuntu 16.04
  • 从(源或二进制)安装的Ray :二进制
  • Ray版本:0.6.5
  • Python版本:3.6

我正在尝试按照教程(link)使用ray和tensorflow 然后我得到了tune error

错误日志


Result logdir: ray_results/tune_gan_test
Number of trials: 2 ({'ERROR': 2})
ERROR trials:
 - train_gan_0_partition=0:     ERROR, 1 failures: ray_results/tune_gan_test/train_gan_0_partition=0_2019-04-05_16-25-5536of9abi/error_2019-04-05_16-26-02.txt
 - train_gan_1_partition=1:     ERROR, 1 failures: ray_results/tune_gan_test/train_gan_1_partition=1_2019-04-05_16-26-1038hprt_a/error_2019-04-05_16-26-12.txt

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/16 CPUs, 0/1 GPUs
Memory usage on this node: 53.0/67.5 GB
Result logdir: ray_results/tune_gan_test
Number of trials: 2 ({'ERROR': 2})
ERROR trials:
 - train_gan_0_partition=0:     ERROR, 1 failures: ray_results/tune_gan_test/train_gan_0_partition=0_2019-04-05_16-25-5536of9abi/error_2019-04-05_16-26-02.txt
 - train_gan_1_partition=1:     ERROR, 1 failures: ray_results/tune_gan_test/train_gan_1_partition=1_2019-04-05_16-26-1038hprt_a/error_2019-04-05_16-26-12.txt

Traceback (most recent call last):
  File "train.py", line 142, in <module>
    **gan_spec)
  File "/lib/python3.6/site-packages/ray/tune/tune.py", line 253, in run
    raise TuneError("Trials did not complete", errored_trials)
ray.tune.error.TuneError: ('Trials did not complete', [train_gan_0_partition=0, train_gan_1_partition=1])

源代码/日志

与ray使用有关的代码:

# !!! Entrypoint for ray.tune !!!
def train(config={'partition': 0}, reporter=None):
    global status_reporter, partition_fn
    status_reporter = reporter
    partition_fn = config['partition']
    tf.app.run(main=main)


# !!! Example of using the ray.tune Python API !!!
if __name__ == "__main__":
    try:
        register_trainable('train_gan', train)
        gan_spec = {
            'stop': {
                'time_total_s': 600,
            },
            'config': {
                'partition': grid_search([0, 1]),
            },
        }

        ray.init()

        tune.run('train_gan',
                 name='tune_gan_test',
                 resources_per_trial={"gpu":1},
                 raise_on_failed_trial=True,
                 queue_trials=True,
                 with_server=False,
                 **gan_spec)

    except KeyboardInterrupt:
        os._exists(1)

我该如何解决?感谢您的帮助:)

0 个答案:

没有答案