Question

我正在使用tensorflow估计器测试分布式训练。在我的示例中，我使用tf.estimator.train_and_evaluation将一个简单的窦性函数与一个自定义估算器拟合。经过培训和评估后，我想导出模型以使其准备好tensorflow serving。但是，只有在以非分布式方式执行估算器时，才会触发估算和导出。

模型和估计量定义如下：

$ ls sin_model/
checkpoint                                  model.ckpt-0.index
eval                                        model.ckpt-0.meta
events.out.tfevents.1532426226.simon        model.ckpt-1000.data-00000-of-00001
export                                      model.ckpt-1000.index
graph.pbtxt                                 model.ckpt-1000.meta
model.ckpt-0.data-00000-of-00001

在单个过程中执行此代码时，我收到一个输出文件夹，其中包含模型检查点，评估数据和模型导出

{"cluster": {
    "ps": ["localhost:2222"],
    "chief": ["localhost:2223"], 
    "worker": ["localhost:2224"]
}

但是，在分发训练过程时（在此测试设置中仅在本地计算机上），缺少eval和export文件夹。

我使用以下群集配置启动各个节点：

$ TF_CONFIG='{"cluster": {"chief": ["localhost:2223"], "worker": ["localhost:2224"], "ps": ["localhost:2222"]}, "task": {"type": "ps", "index": 0}}' CUDA_VISIBLE_DEVICES= python custom_estimator.py
2018-07-24 12:09:04.913967: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-07-24 12:09:04.914008: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:132] retrieving CUDA diagnostic information for host: simon
2018-07-24 12:09:04.914013: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:139] hostname: simon
2018-07-24 12:09:04.914035: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] libcuda reported version is: 384.130.0
2018-07-24 12:09:04.914059: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:167] kernel reported version is: 384.130.0
2018-07-24 12:09:04.914079: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:249] kernel version seems to match DSO: 384.130.0
2018-07-24 12:09:04.914961: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> localhost:2223}
2018-07-24 12:09:04.914971: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-07-24 12:09:04.914976: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2224}
2018-07-24 12:09:04.915658: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:369] Started server with target: grpc://localhost:2222

ps服务器的启动如下所示

CUDA_VISIBLE_DEVICES=

（我在命令行后附加了failed call to cuInit: CUDA_ERROR_NO_DEVICE，以防止工作人员和负责人分配GPU内存。这会导致$ TF_CONFIG='{"cluster": {"chief": ["localhost:2223"], "worker": ["localhost:2224"], "ps": ["localhost:2222"]}, "task": {"type": "chief", "index": 0}}' CUDA_VISIBLE_DEVICES= python custom_estimator.py 2018-07-24 12:09:10.532171: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE 2018-07-24 12:09:10.532234: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:132] retrieving CUDA diagnostic information for host: simon 2018-07-24 12:09:10.532241: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:139] hostname: simon 2018-07-24 12:09:10.532298: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] libcuda reported version is: 384.130.0 2018-07-24 12:09:10.532353: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:167] kernel reported version is: 384.130.0 2018-07-24 12:09:10.532359: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:249] kernel version seems to match DSO: 384.130.0 2018-07-24 12:09:10.533195: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> localhost:2223} 2018-07-24 12:09:10.533207: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222} 2018-07-24 12:09:10.533211: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2224} 2018-07-24 12:09:10.533835: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:369] Started server with target: grpc://localhost:2223 2018-07-24 12:09:14.038636: I tensorflow/core/distributed_runtime/master_session.cc:1165] Start master session 71a2748ad69725ae with config: allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } }错误，但这并不严重）

酋长然后开始如下

$ TF_CONFIG='{"cluster": {"chief": ["localhost:2223"], "worker": ["localhost:2224"], "ps": ["localhost:2222"]}, "task": {"type": "worker", "index": 0}}' CUDA_VISIBLE_DEVICES= python custom_estimator.py
2018-07-24 12:09:13.172260: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-07-24 12:09:13.172320: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:132] retrieving CUDA diagnostic information for host: simon
2018-07-24 12:09:13.172327: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:139] hostname: simon
2018-07-24 12:09:13.172362: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] libcuda reported version is: 384.130.0
2018-07-24 12:09:13.172399: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:167] kernel reported version is: 384.130.0
2018-07-24 12:09:13.172405: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:249] kernel version seems to match DSO: 384.130.0
2018-07-24 12:09:13.173230: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> localhost:2223}
2018-07-24 12:09:13.173242: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-07-24 12:09:13.173247: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2224}
2018-07-24 12:09:13.173783: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:369] Started server with target: grpc://localhost:2224
2018-07-24 12:09:18.774264: I tensorflow/core/distributed_runtime/master_session.cc:1165] Start master session 1d13ac84816fdc80 with config: allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } }

然后按以下步骤启动工作程序：

sin_model

过一会儿，主要过程停止，并且$ ls sin_model/ checkpoint model.ckpt-0.meta events.out.tfevents.1532426950.simon model.ckpt-1001.data-00000-of-00001 graph.pbtxt model.ckpt-1001.index model.ckpt-0.data-00000-of-00001 model.ckpt-1001.meta model.ckpt-0.index文件夹存在，带有模型检查点，但没有导出或评估：

{{1}}

是否需要任何其他配置才能评估或导出分布式设置？

我正在使用python 3.5和tensorflow 1.8

Answer 1

在分布式模式下，您可以通过将任务type设置为evaluator来与培训并行进行评估：

{
   "cluster": {
     "ps": ["localhost:2222"],
     "chief": ["localhost:2223"], 
     "worker": ["localhost:2224"]
   },
   "task": {
     "type": "evaluator", "index": 0
   },
   "environment": "cloud"
}

您无需在集群定义中定义evaluator。另外，不确定这是否与您的情况有关，但也许在群集配置中设置environment: 'cloud'可能会有所帮助。

分布式Tensorflow Estimator执行不会触发评估或导出

1 个答案: