我正在阅读Tensorflow教程中与分布式学习相关的页面,尤其是查看Distributed Tensorflow页面以及Multi-worker Training Using Distribution Strategies中给出的示例,但是说明有些不足或令人困惑。我感兴趣的是使用不同机器上的处理器运行一个简单但完整的分布式培训示例,并且我认为使用这些链接中描述的估计器是实现此目的的最佳方法。
从我的阅读中了解到,我应该定义一个ps_host和一些worker_hosts并按照第一个链接中的描述进行定义。我的脚本将是类似的,但是我应该在每台计算机上使用不同的命令行参数来运行它。我认为我可以使用elif FLAGS.job_name == "worker":
行后面的with tf.device(...
部分之外的“全部放在一起”部分中显示的整个代码,因为它不使用估计器,并且没有用于读取任何输入的代码。要替换它,我在第二个链接中查看了脚本keras_model_to_estimator.py
并放了函数main()
的代码(通过将MirroredStrategy更改为MultiWorkerMirroredStrategy,因为我没有GPU,并且想分发它在某些CPU中)。另外,我应该在脚本中添加input_fn()
函数。
我的第一个问题是:这种方法正确吗?还是我错过或误解了某些东西?
第二个是:当我运行下面的代码时,它说module 'tensorflow.contrib.distribute' has no attribute 'MultiWorkerMirroredStrategy'
。那我该如何使用这个策略呢?我没有GPU并运行Tensorflow 1.13。
我的代码如下:
import argparse
import sys
import tensorflow as tf
import numpy as np
FLAGS = None
def input_fn():
x = np.random.random((1024, 10))
y = np.random.randint(2, size=(1024, 1))
x = tf.cast(x, tf.float32)
dataset = tf.data.Dataset.from_tensor_slices((x, y))
dataset = dataset.repeat(100)
dataset = dataset.batch(32)
return dataset
def main(_):
ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts.split(",")
# Create a cluster from the parameter server and worker hosts.
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
# Create and start a server for the local task.
server = tf.train.Server(cluster,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index)
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
# Assigns ops to the local worker by default.
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,
cluster=cluster)):
model_dir = 'C:/Temp'
print('Using %s to store checkpoints.' % model_dir)
# Define a Keras Model.
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(16, activation='relu', input_shape=(10,)))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
# Compile the model.
optimizer = tf.train.GradientDescentOptimizer(0.2)
model.compile(loss='binary_crossentropy', optimizer=optimizer)
model.summary()
tf.keras.backend.set_learning_phase(True)
# Define DistributionStrategies and convert the Keras Model to an
# Estimator that utilizes these DistributionStrateges.
# Evaluator is a single worker, so using MirroredStrategy.
config = tf.estimator.RunConfig(
experimental_distribute=tf.contrib.distribute.DistributeConfig(
train_distribute=tf.contrib.distribute.CollectiveAllReduceStrategy(
num_gpus_per_worker=2),
eval_distribute=tf.contrib.distribute.MultiWorkerMirroredStrategy(
num_gpus_per_worker=2)))
keras_estimator = tf.keras.estimator.model_to_estimator(
keras_model=model, config=config, model_dir=model_dir)
# Train and evaluate the model. Evaluation will be skipped if there is not an
# "evaluator" job in the cluster.
tf.estimator.train_and_evaluate(
keras_estimator,
train_spec=tf.estimator.TrainSpec(input_fn=input_fn),
eval_spec=tf.estimator.EvalSpec(input_fn=input_fn))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.register("type", "bool", lambda v: v.lower() == "true")
# Flags for defining the tf.train.ClusterSpec
parser.add_argument(
"--ps_hosts",
type=str,
default="",
help="Comma-separated list of hostname:port pairs"
)
parser.add_argument(
"--worker_hosts",
type=str,
default="",
help="Comma-separated list of hostname:port pairs"
)
parser.add_argument(
"--job_name",
type=str,
default="",
help="One of 'ps', 'worker'"
)
# Flags for defining the tf.train.Server
parser.add_argument(
"--task_index",
type=int,
default=0,
help="Index of task within the job"
)
FLAGS, unparsed = parser.parse_known_args()
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)