Tensorflow错误-“容器本地主机不存在。(找不到资源:本地主机/ _AnonymousVar0)”

时间:2020-05-29 05:52:49

标签: python tensorflow keras deep-learning

我正在尝试使用笔记本电脑本地的其他数据集运行Tensorflow's Transformer tutorial code。不幸的是,我得到了某个Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0)。 (我相信这是主要的错误,但我可能是错误的。)

这与众不同的是,只有在模型训练了几个纪元之后,我才得到此错误。。

这是整个日志:(我在日志的上半部分进行了修整,其中显示了张量流的初始化,那里没有显示错误/警告)

Train for 7290 steps
Epoch 1/15
2020-05-28 22:57:18.046206: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
7290/7290 [==============================] - 1986s 272ms/step - loss: 2.0052 - accuracy: 0.0939
Epoch 2/15
7290/7290 [==============================] - 1971s 270ms/step - loss: 1.6234 - accuracy: 0.1223
Epoch 3/15
7290/7290 [==============================] - 1968s 270ms/step - loss: 1.5535 - accuracy: 0.1291
Epoch 4/15
7290/7290 [==============================] - 1968s 270ms/step - loss: 1.5192 - accuracy: 0.1325
Epoch 5/15
7290/7290 [==============================] - 1968s 270ms/step - loss: 1.4978 - accuracy: 0.1348
Epoch 6/15
7290/7290 [==============================] - 1967s 270ms/step - loss: 1.4825 - accuracy: 0.1364
Epoch 7/15
7290/7290 [==============================] - 1967s 270ms/step - loss: 1.4711 - accuracy: 0.1376
Epoch 8/15
7290/7290 [==============================] - 1966s 270ms/step - loss: 1.4621 - accuracy: 0.1386
Epoch 9/15
 174/7290 [..............................] - ETA: 32:11 - loss: 1.4382 - accuracy: 0.13312020-05-29 03:20:43.528885: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at resource_variable_ops.cc:540 : Not found: Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0)
2020-05-29 03:20:43.528953: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Not found: Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0)
     [[{{node Adam/Adam/update/AssignSubVariableOp}}]]
     [[GroupCrossDeviceControlEdges_0/Adam/Adam/Const/_301]]
2020-05-29 03:20:43.529025: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Not found: Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0)
     [[{{node Adam/Adam/update/AssignSubVariableOp}}]]
 175/7290 [..............................] - ETA: 32:14 - loss: 1.4382 - accuracy: 0.1331Traceback (most recent call last):
  File "model.py", line 114, in <module>
    model.fit(dataset, epochs=EPOCHS)
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 342, in fit
    total_epochs=epochs)
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 128, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py", line 98, in execution_function
    distributed_function(input_fn))
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 599, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2363, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
    ctx=ctx)
  File "/home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
  (0) Not found:  Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0)
     [[node Adam/Adam/update/AssignSubVariableOp (defined at model.py:114) ]]
  (1) Not found:  Container localhost does not exist. (Could not find resource: localhost/_AnonymousVar0)
     [[node Adam/Adam/update/AssignSubVariableOp (defined at model.py:114) ]]
     [[GroupCrossDeviceControlEdges_0/Adam/Adam/Const/_301]]
0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_function_15977]

Errors may have originated from an input operation.
Input Source operations connected to node Adam/Adam/update/AssignSubVariableOp:
 transformer/encoder/embedding/embedding_lookup/11773 (defined at /home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/contextlib.py:112)

Input Source operations connected to node Adam/Adam/update/AssignSubVariableOp:
 transformer/encoder/embedding/embedding_lookup/11773 (defined at /home/atulu/anaconda3/envs/tf2-gpu/lib/python3.7/contextlib.py:112)

Function call stack:
distributed_function -> distributed_function

我正在使用的代码:
model.py-https://pastebin.com/FVaj1V5W。这是进行训练的文件。

模型的定义在同一目录中的另一个脚本中:model_definition.py-https://pastebin.com/HyV2RMY2

运行环境:
Tensorflow版本:2.1.0(Tensorflow GPU)
Pythnon版本:3.7.7
GPU-Nvidia GTX 1660 Ti,6GB
操作系统:Ubuntu 20.04 LTS

0 个答案:

没有答案