我正在尝试在两台机器上分发培训和评估:
为此,我正在尝试调整tf.contrib.learn.Experiment
框架,但我似乎无法使集群规范正确。
这是我的代码的简化版本:
def get_schedule(run_config):
if run_config.task_type == 'ps':
return 'run_std_server'
if run_config.task_type == 'worker':
return 'train'
if run_config.task_type == 'evaluator':
return 'continuous_eval'
if run_config.task_type == 'master':
return 'train'
raise ValueError('Unknown task type "{}"'.format(run_config.task_type))
def deeplpr_model_fn(features, labels, mode, cluster_spec={}):
with tf.device(tf.train.replica_device_setter(cluster=cluster_spec)):
logits = build_model(features['images'], mode)
#[...] Standard Estimator setup for training & evaluation
return tf.estimator.EstimatorSpec(mode=mode,
predictions=predictions,
export_outputs=export_outputs,
loss=loss,
train_op=train_op,
eval_metric_ops=metrics)
def get_experiment(run_config=None, hparams=None):
# Create the Estimator
estimator = tf.estimator.Estimator(
model_fn=lambda features, labels, mode : my_model_fn(features, labels, mode, run_config.cluster_spec),
model_dir=FLAGS.model_dir,
config=run_config)
# Set up input functions for training and evaluation
train_input_fn = lambda : input_fn(tf.estimator.ModeKeys.TRAIN, FLAGS.batch_size)
eval_input_fn = lambda : input_fn(tf.estimator.ModeKeys.EVAL, FLAGS.batch_size)
# Set up the experiment
experiment = tf.contrib.learn.Experiment(
estimator=estimator,
train_input_fn=train_input_fn,
eval_input_fn=eval_input_fn,
train_steps=FLAGS.steps,
eval_steps=None,
eval_delay_secs=20, # time to wait before running the first evaluation
train_steps_per_iteration=2000)
return experiment
主要功能如下:
def distributed_main(unused_argv):
import json
# Set up environment variables according to the parameters passed to the process
TF_CONFIG = {
'cluster': {
"worker": [
"pc1:2222",
],
"ps": [
"pc1:2223",
],
"evaluator": [
"pc2:2224",
]
},
'environment': 'cluster',
'task': {
'type': unused_argv[1].strip(),
'index': unused_argv[2].strip() if len(unused_argv) > 2 else 0
}
}
os.environ['TF_CONFIG'] = json.dumps(TF_CONFIG)
session_config = tf.ConfigProto(device_filters=device_filters,
allow_soft_placement=True)
config = tf.contrib.learn.RunConfig(model_dir=FLAGS.model_dir,
session_config=session_config)
schedule = get_schedule(config)
tf.logging.info('Beginning task {}:{}'.format(config.task_type, config.task_id))
# Run the function
tf.contrib.learn.learn_runner.run(get_experiment, schedule=schedule, run_config=config)
其中unused_argv
包含作业名称和可选的索引(默认为0)。
使用适当的作业名称和任务ID运行三个进程,我无法让工作人员通过会话初始化步骤,因为它希望evaluator
与主要工作人员进行通信(如果没有,则显然是continuous_eval
。
研究这个问题,我发现this answer他们建议添加device_filter
,所以我尝试添加:
device_filters=["/job:ps", "/job:worker"]
if unused_argv[1] != 'worker':
device_filters += ['/job:evaluator']
session_config = tf.ConfigProto(device_filters=device_filters)
config = tf.contrib.learn.RunConfig(model_dir=FLAGS.model_dir, session_config=session_config)
这有效地解锁了worker和ps,但是当尝试恢复最新的检查点时,求值程序崩溃了:
Traceback (most recent call last):
File "deeplpr.py", line 431, in <module>
tf.app.run(main=distributed_main, argv=[sys.argv[0]] + unparsed)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run
_sys.exit(main(argv))
File "deeplpr.py", line 420, in distributed_main
tf.contrib.learn.learn_runner.run(get_experiment, schedule=schedule, run_config=config)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\contrib\learn\python\learn\learn_runner.py", line 218, in run
return _execute_schedule(experiment, schedule)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\contrib\learn\python\learn\learn_runner.py", line 46, in _execute_schedule
return task()
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\contrib\learn\python\learn\experiment.py", line 573, in continuous_eval
continuous_eval_predicate_fn=continuous_eval_predicate_fn)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\contrib\learn\python\learn\experiment.py", line 533, in _continuous_eval
hooks=self._eval_hooks)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\contrib\learn\python\learn\experiment.py", line 894, in _call_evaluate
hooks=hooks)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 414, in evaluate
name=name)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\estimator\estimator.py", line 949, in _evaluate_model
config=self._session_config)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\evaluation.py", line 209, in _evaluate_once
session_creator=session_creator, hooks=hooks) as session:
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 795, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 518, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 981, in __init__
_WrappedSession.__init__(self, self._create_session())
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 986, in _create_session
return self._sess_creator.create_session()
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 675, in create_session
self.tf_sess = self._session_creator.create_session()
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\monitored_session.py", line 446, in create_session
init_fn=self._scaffold.init_fn)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\session_manager.py", line 275, in prepare_session
config=config)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\session_manager.py", line 191, in _restore_checkpoint
saver.restore(sess, checkpoint_filename_with_path)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\training\saver.py", line 1760, in restore
{self.saver_def.filename_tensor_name: save_path})
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 905, in run
run_metadata_ptr)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1355, in _do_run
options, run_metadata)
File "C:\Users\1\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'save/RestoreV2_1': Operation was explicitly assigned to /job:ps/task:0/device:CPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.
[[Node: save/RestoreV2_1 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:ps/task:0/device:CPU:0"](save/Const, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]
仅指定工人进行评估的正确方法是什么?
在Tensorflow的日志中,我看到使用的RunConfig
有一个参数'_evaluation_master': ''
,但我找不到任何关于它的文档。这有点关联吗?是否有任何工作示例显示如何分发训练和评估之间的实验?
根据建议,我在定义log_device_placement=True
时添加了session_config
。
但是,在记录设备放置位置之前,日志输出似乎崩溃了:
INFO:tensorflow:Waiting 20.000000 secs before starting eval.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-02-19-15:35:04
INFO:tensorflow:Graph was finalized.
2018-02-19 16:35:04.888165: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
INFO:tensorflow:Restoring parameters from models\test_distributed\model.ckpt-2373
# Here starts the traceback of the error, same as above
这有点令人困惑,我不应该在创建会话时看到放置日志吗?并且restore
op不需要运行会话吗?
设置allow_device_placement=True
也没有改变日志和错误中的任何内容。
log_device_placement=True
只会将其记录在worker:0
(也就是主计算机)中,我认为这是预期的行为
更新了上面的代码,以反映我如何设置allow_device_placement=True
(仅更改了主要功能)。
答案 0 :(得分:1)
在PC2
上将allow_soft_placement=True
添加到session_config
。
session_config = tf.ConfigProto(allow_soft_placement=True)
阅读this section以获取有关此参数的更多信息。
如果您希望TensorFlow自动选择现有的和 支持的设备在指定的情况下运行操作 不存在,您可以在中设置allow_soft_placement为True 创建会话时的配置选项
仔细查看错误日志后,我发现此行已触发错误。
tf.contrib.learn.learn_runner.run(
get_experiment,
schedule=schedule,
run_config=config
)
因此,allow_soft_placement=True
应设置在run_config
参数中,如下所示。
config = tf.contrib.learn.RunConfig(
model_dir=FLAGS.model_dir,
session_config=tf.ConfigProto(allow_soft_placement=True)
)