在Azure上使用GPU但不使用CPU时的ResourceException

时间:2018-03-07 13:24:17

标签: azure gpu

我的代码能够成功构建图形并在Azure ML上以CPU模式运行图形,但GPU在图形构建阶段报告ResourceException。

我只需删除设备命令即可在CPU和GPU模式之间切换:

使用tf.device(' / cpu:0'),tf.name_scope('嵌入'):#cpu模式运行良好

使用tf.name_scope('嵌入'):#gpu mode throw exception

我尝试加载较少的数据,但也没有。

我怀疑在设置GPU时我错过了一些步骤。有什么想法吗?

Azure错误消息:

tensorflow.python.framework.errors_impl.ResourceExhaustedError:分配张量形状时的OOM [78298,300] [[Node:embedding_matrix / Assign = Assign [T = DT_FLOAT,_class = [" loc:@ embedding_matrix"],use_locking = true,validate_shape = true,_device =" / job:localhost / replica :0 /任务:0 /设备:GPU:0"](embedding_matrix,embedding_matrix / Initializer / Const)]]

完成错误消息:

追踪(最近一次通话): 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py" ;,第1323行,在_do_call return fn(* args) 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py" ;,第1302行,在_run_fn中 status,run_metadata) 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py" ;,第473行,退出 c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError:分配形状的张量时的OOM [78298,300] [[Node:embedding_matrix / Assign = Assign [T = DT_FLOAT,_class = [" loc:@ embedding_matrix"],use_locking = true,validate_shape = true,_device =" / job:localhost / replica :0 /任务:0 /设备:GPU:0"](embedding_matrix,embedding_matrix / Initializer / Const)]]

在处理上述异常期间,发生了另一个异常:

追踪(最近一次通话): 文件" NN.py",第130行,in sess.run(INIT) 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py" ;,第889行,在运行中 run_metadata_ptr) 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py" ;,第1120行,在_run中 feed_dict_tensor,options,run_metadata) 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py",第1317行,在_do_run中 选项,run_metadata) 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py" ;,第1336行,在_do_call 提升类型(e)(node_def,op,message) tensorflow.python.framework.errors_impl.ResourceExhaustedError:分配形状的张量时的OOM [78298,300] [[Node:embedding_matrix / Assign = Assign [T = DT_FLOAT,_class = [" loc:@ embedding_matrix"],use_locking = true,validate_shape = true,_device =" / job:localhost / replica :0 /任务:0 /设备:GPU:0"](embedding_matrix,embedding_matrix / Initializer / Const)]]

由op' embedding_matrix / Assign'引起,定义于: 文件" NN.py",第120行,in ,trainable = False) 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py" ;,第1203行,在get_variable中 约束=约束) 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py",第1092行,在get_variable中 约束=约束) 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py" ;,第425行,在get_variable中 约束=约束) 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py",第394行,在_true_getter中 use_resource = use_resource,constraint = constraint) 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py",第805行,在_get_single_variable中 约束=约束) 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variables.py",第213行, init 约束=约束) 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variables.py",第346行,在_init_from_args中 validate_shape = validate_shape).OP 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/state_ops.py" ;,第276行,分配 validate_shape = validate_shape) 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/gen_state_ops.py" ;,第57行,分配 use_locking = use_locking,name = name) 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py",第787行,在_apply_op_helper中 op_def = op_def) 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py",第2956行,在create_op中 op_def = op_def) 文件" /anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py",第1470行, init self._traceback = self._graph._extract_stack()#pylint:disable = protected-access

ResourceExhaustedError(参见上面的回溯):OOM在分配具有形状的张量时[78298,300] [[Node:embedding_matrix / Assign = Assign [T = DT_FLOAT,_class = [" loc:@ embedding_matrix"],use_locking = true,validate_shape = true,_device =" / job:localhost / replica :0 /任务:0 /设备:GPU:0"](embedding_matrix,embedding_matrix / Initializer / Const)]]

1 个答案:

答案 0 :(得分:0)

主机内存比N系列计算机的设备内存大很多。 您确定只是没有超出设备容量吗?