Question

我正在尝试运行GitHub存储库“ Face-Aging-CAAE”， https://github.com/ZZUTK/Face-Aging-CAAE 该代码在我的CPU上运行（大约需要3天），但是在GPU上，它在执行session.run（）时终止，并且没有错误输出。

此处，代码在GPU上运行，并在创建“初始模型”时结束运行：

In [1]: runfile('/media/.../face-aging-caae/Face-Aging-CAAE-master/main.py', wdir='/media/.../face-aging-caae/Face-Aging-CAAE-master')
Namespace(dataset='UTKFace', epoch=50, is_train=True, savedir='save', testdir='None', use_init_model=True, use_trained_model=True)

        Building graph ...
WARNING:tensorflow:From /home/.../anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.

        Training Mode

        Loading pre-trained model ...
        FAILED >_<!

        Loading init model ...
INFO:tensorflow:Restoring parameters from init_model/model-init

In [1]:

代码在执行“ FaceAging.py”上的该块期间退出：

                # update
                _, _, _, EG_err, Ez_err, Dz_err, Dzp_err, Gi_err, DiG_err, Di_err, TV = self.session.run(
                    fetches = [
                        self.EG_optimizer,
                        self.D_z_optimizer,
                        self.D_img_optimizer,
                        self.EG_loss,
                        self.E_z_loss,
                        self.D_z_loss_z,
                        self.D_z_loss_prior,
                        self.G_img_loss,
                        self.D_img_loss_G,
                        self.D_img_loss_input,
                        self.tv_loss
                    ],
                    feed_dict={
                        self.input_image: batch_images,
                        self.age: batch_label_age,
                        self.gender: batch_label_gender,
                        self.z_prior: batch_z_prior
                    }
                )

系统：

Ubuntu 18.04.2 LTS
CPU：Intel Xeon E5（16GB RAM）
GPU：Nvidia Geforce Gtx 1050 Ti（4GB）
conda，python 2.7，tensorflow-gpu 1.7.0，scipy 1.0.0（代码先决条件）

GPU在此环境下可以与我测试过的其他简单代码一起使用。

我试图在GPU上显式运行代码

with tf.device('/gpu:0'):
    tf.app.run()

但是它给出了错误（错误再次消失，并且在“允许软放置”之后代码返回到先前的行为）：

InvalidArgumentError: Cannot assign a device for operation 'global_step': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices: 
AssignAdd: CPU 
Const: GPU CPU 
Assign: CPU 
VariableV2: CPU 
Identity: CPU 

Colocation members and user-requested devices:
  global_step (VariableV2) /device:GPU:0
  global_step/read (Identity) /device:GPU:0
  global_step/Assign (Assign) /device:GPU:0
  opt/Adam/value (Const) /device:GPU:0
  opt/Adam (AssignAdd) /device:GPU:0

Registered kernels:
  device='CPU'
  device='GPU'; dtype in [DT_INT64]
  device='GPU'; dtype in [DT_DOUBLE]
  device='GPU'; dtype in [DT_FLOAT]
  device='GPU'; dtype in [DT_HALF]

     [[Node: global_step = VariableV2[container="", dtype=DT_INT32, shape=[], shared_name="", _device="/device:GPU:0"]()]]

Caused by op u'global_step', defined at:
  File "/home/.../anaconda3/envs/py27/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/spyder_kernels/console/__main__.py", line 11, in <module>
    start.main()
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/spyder_kernels/console/start.py", line 310, in main
    kernel.start()
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 499, in start
    self.io_loop.start()
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/tornado/ioloop.py", line 1073, in start
    handler_func(fd_obj, events)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 456, in _handle_events
    self._handle_recv()
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 486, in _handle_recv
    self._run_callback(callback, msg)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 438, in _run_callback
    callback(*args, **kwargs)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2714, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2824, in run_ast_nodes
    if self.run_code(code, result):
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2878, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-1-83c713e248d3>", line 1, in <module>
    runfile('/home/.../face-aging-caae/Face-Aging-CAAE-master/main.py', wdir='/home/.../face-aging-caae/Face-Aging-CAAE-master')
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 786, in runfile
    execfile(filename, namespace)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 102, in execfile
    builtins.execfile(filename, *where)
  File "/home/.../face-aging-caae/Face-Aging-CAAE-master/main.py", line 70, in <module>
    tf.app.run()
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/home/.../face-aging-caae/Face-Aging-CAAE-master/main.py", line 59, in main
    use_init_model=FLAGS.use_init_model
  File "FaceAging.py", line 208, in train
    self.EG_global_step = tf.Variable(0, trainable=False, name='global_step')
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 235, in __init__
    constraint=constraint)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 365, in _init_from_args
    name=name)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/ops/state_ops.py", line 135, in variable_op_v2
    shared_name=shared_name)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py", line 1131, in variable_v2
    shared_name=shared_name, name=name)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
    op_def=op_def)
  File "/home/.../anaconda3/envs/py27/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1654, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'global_step': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:
Colocation group had the following types and devices: 
AssignAdd: CPU 
Const: GPU CPU 
Assign: CPU 
VariableV2: CPU 
Identity: CPU 

Colocation members and user-requested devices:
  global_step (VariableV2) /device:GPU:0
  global_step/read (Identity) /device:GPU:0
  global_step/Assign (Assign) /device:GPU:0
  opt/Adam/value (Const) /device:GPU:0
  opt/Adam (AssignAdd) /device:GPU:0

Registered kernels:
  device='CPU'
  device='GPU'; dtype in [DT_INT64]
  device='GPU'; dtype in [DT_DOUBLE]
  device='GPU'; dtype in [DT_FLOAT]
  device='GPU'; dtype in [DT_HALF]

     [[Node: global_step = VariableV2[container="", dtype=DT_INT32, shape=[], shared_name="", _device="/device:GPU:0"]()]]

我是TensorFlow初学者。而且，如果在弱GPU上运行此类代码时有什么需要考虑的，请告诉我。

谢谢。

Answer 1

使用VSCode，此消息出现：

Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms) Aborted (core dumped)

我检查了兼容性，发现该版本的tf需要cudnn 7.3，与此表相对应： https://www.tensorflow.org/install/source#tested_build_configurations

我将cudnn降级为7.0.5，并且代码运行没有问题（分别为7h和7m）。

Tensorflow代码在“ session.run（）”期间结束，并且在1050 Ti GPU上没有错误输出

1 个答案: