张量流分布式训练w /估计器+实验框架

时间:2017-03-27 05:05:53

标签: tensorflow

嗨我在尝试使用估算器+实验课进行分布式培训时,处于一种运作状态。

以下是一个例子:https://gist.github.com/protoget/2cf2b530bc300f209473374cf02ad829

这是一个使用

的简单案例
  1. 来自TF官方教程的DNNC分类器
  2. 实验框架
  3. 1个worker和1 ps在同一主机上有不同的端口。
  4. 会发生什么

    1)当我开始ps工作时,看起来不错:

    W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
    W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
    W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
    W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
    I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9000}
    I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:9001}
    I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:9000
    

    2)当我开始工作时,工作默默地退出,根本不留任何记录。

    急切寻求帮助。

2 个答案:

答案 0 :(得分:1)

我有同样的问题,我终于得到了解决方案。

问题出在config._environment

config = {"cluster": {'ps':     ['127.0.0.1:9000'],
                      'worker': ['127.0.0.1:9001']}}

if args.type == "worker":
    config["task"] = {'type': 'worker', 'index': 0}
else:
    config["task"] = {'type': 'ps', 'index': 0}

os.environ['TF_CONFIG'] = json.dumps(config)

config = run_config.RunConfig()

config._environment = run_config.Environment.CLOUD

将config._environment设置为Environment.CLOUD

然后你可以有分布式培训系统。

我希望它让你开心:)

答案 1 :(得分:1)

我有同样的问题,这是由于我猜的一些内部张量流代码,我已经在SO上打开了一个问题:TensorFlow: minimalist program fails on distributed mode

我还打开了拉取请求:https://github.com/tensorflow/tensorflow/issues/8796

有两种方法可以解决您的问题。由于这是由于ClusterSpec具有隐式local环境,您可以尝试设置另一个(googlecloud),但我无法向您保证剩下的你的工作不会受到影响。所以我更喜欢看一下代码并尝试自己修复本地模式,这就是我解释的原因。

你会更准确地看到为什么它在这些帖子中失败的解释,事实是Google到目前为止一直很沉默,所以我做的是我修补了他们的源代码(在tensorflow/contrib/learn/python/learn/experiment.py中):

# Start the server, if needed. It's important to start the server before
# we (optionally) sleep for the case where no device_filters are set.
# Otherwise, the servers will wait to connect to each other before starting
# to train. We might as well start as soon as we can.
config = self._estimator.config
if (config.environment != run_config.Environment.LOCAL and
   config.environment != run_config.Environment.GOOGLE and
   config.cluster_spec and config.master):
 self._start_server()

(这部分会阻止服务器以本地模式启动,如果您在群集规范中没有设置,那么这是您的,所以您应该只是评论config.environment != run_config.Environment.LOCAL and,这应该有效。)