我正在将GANEstimator与MirroredStrategy一起用于单个实例的多个GPU。在我的情况下,input_fn
是tf.data.Dataset
,具有以下设置:
dataset = dataset.repeat()
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(self.batch_size, drop_remainder=True)
dataset = dataset.prefetch(100)
之所以这样问,是因为我需要手动指定类似dataset.shard()
之类的内容,才能将不同的数据传递给工作人员吗?我正在研究Estimator和MirroredStrategy的代码,但是我不清楚发生了什么。从description of distributed strategies会造成其他混乱:
MirroredStrategy: This does in-graph replication with synchronous
training on many GPUs on one machine. Essentially, we create copies of all
variables in the model's layers on each device. We then use all-reduce
to combine gradients across the devices before applying them
to the variables to keep them in sync.
CollectiveAllReduceStrategy: This is a version of MirroredStrategy
for multi-worker training.
那么MirroredStratedy是否只使用一名工人?我不明白我需要指定等于一个塔的容量的批处理大小,否则得到OOM。有人可以将我指向代码,并说明这种简单的设置如何与批处理一起工作:
def create_dataset():
...
dataset = dataset.repeat()
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(self.batch_size, drop_remainder=True)
dataset = dataset.prefetch(100)
return dataset
NUM_GPUS = 4
strategy = tf.contrib.distribute.MirroredStrategy(num_gpus=NUM_GPUS)
optimizer = tf.train.RMSPropOptimizer(learning_rate=0.01, use_locking=True)
optimizer_d = tf.train.RMSPropOptimizer(learning_rate=0.01, use_locking=True)
config = tf.estimator.RunConfig(save_checkpoints_steps=100,
save_summary_steps=1, keep_checkpoint_max=50,
train_distribute=strategy)
# I have more hooks here, just simplified to show
def get_hooks_fn(GANTrainOps):
disjoint_train_hook_func = tfgan.get_sequential_train_hooks(
train_steps=tfgan.GANTrainSteps(10, 1)
) # g steps, d steps
disjoint_train_hooks = disjoint_train_hook_func(GANTrainOps)
return [update_hook, summary_hook] + disjoint_train_hooks
# Create GAN estimator.
gan_estimator = tfgan.estimator.GANEstimator(
model_dir = '/data/checkpoints/estimator_model',
generator_fn = generator_fn,
discriminator_fn = discriminator_fn,
generator_loss_fn = generator_loss_fn,
discriminator_loss_fn = discriminator_loss_fn,
generator_optimizer = optimizer,
discriminator_optimizer = optimizer_d,
use_loss_summaries=True,
config=config,
get_hooks_fn=get_hooks_fn)
gan_estimator.train(input_fn=create_dataset, steps=10000)
谢谢!
MirroredStrategy的代码包含:
1)奇怪的措辞:
此类的多工作人员版本将一个副本映射到服务器上的一个设备。 工人。它在所有副本上镜像所有模型变量。例如,如果您 有两个
worker
,每个worker
有4个GPU,它将创建8个副本 这8个GPU上的模型变量。然后像在MirroredStrategy(???)中一样,每个 副本使用自己的变量副本执行计算,除非在 发生变量或张量减少的交叉复制模型。
2)
auto_shard_dataset:是否在存在以下情况时自动分片数据集 多名工人。
此参数默认为False。
编辑:
到目前为止,我发现tf.estimator.train()
在一段时间后指向似乎是strategy.make_input_fn_iterator()
的地方:
def _get_iterator_from_input_fn(self, input_fn, mode, distribution=None):
if distribution is not None:
iterator = distribution.make_input_fn_iterator(
lambda _: self._call_input_fn(input_fn, mode))
input_hooks = [
estimator_util.DistributedIteratorInitializerHook(iterator)]
else:
result = self._call_input_fn(input_fn, mode)
iterator = result.make_initializable_iterator()
input_hooks = [estimator_util._DatasetInitializerHook(iterator)]
return iterator, input_hooks
make_input_fn_iterator()
但是它已从MirroredStrategy的代码中删除,并且不再存在!我不知道它是如何工作的以及数据集实际上在哪里拆分。
EDIT2:我在使用grep的tensorflow 1.12.0发行版中找不到行make_input_fn_iterator
。似乎代码中完全不存在。
答案 0 :(得分:1)
好吧,花了一些时间研究github之后,我发现它已经和我的tf 1.12.0不同了。因此,进入1.12.0的本地文件可以得到:
GANEstimator继承了tf.python.estimator.Estimator
Estimator.init():
# The distribute field contains an instance of DistributionStrategy.
self._train_distribution = self._config.train_distribute
然后向下的路径是:
tf.contrib.gan.GANEstimator -> tf.python.estimator.Estimator.train() -->
tf.python.estimator.Estimator._train_model(input_fn, hooks, saving_listeners) -->
._train_model_distributed(input_fn, hooks, saving_listeners) -->
._get_iterator_from_input_fn(input_fn, model_fn_lib.ModeKeys.TRAIN, self._train_distribution) -->
distribution.distribute_dataset(lambda: self._call_input_fn(input_fn, mode))
在我的情况下要求MirrorredStrategy.distribute_dataset():
def distribute_dataset(self, dataset_fn):
if self._cluster_spec:
return values.MultiWorkerDataset(
partial(self._call_dataset_fn, dataset_fn), self._worker_device_map,
self._prefetch_on_device, self._auto_shard_dataset)
else:
return values.PerDeviceDataset(
self._call_dataset_fn(dataset_fn), self._devices,
self._prefetch_on_device)
tensorflow/python/training/distribute.py
:
def _call_dataset_fn(self, dataset_fn):
result = dataset_fn()
if not isinstance(result, dataset_ops.Dataset):
raise ValueError(
"dataset_fn() must return a tf.data.Dataset when using a "
"DistributionStrategy.")
return result
我假设使用了PerDeviceDataset
,所以最后我在values.py
中找到了这两个类:
class PerDeviceDataset(object):
"""Like `tf.data.Dataset` split devices, producing `PerDevice` data."""
def __init__(self, dataset, devices, prefetch_on_device=None):
self._devices = devices
# Default to using prefetching in graph mode, unless specified.
# TODO(priyag): Enable prefetching in eager mode.
self._prefetch_on_device = prefetch_on_device
if self._prefetch_on_device is None:
self._prefetch_on_device = not context.executing_eagerly()
assert not (self._prefetch_on_device and context.executing_eagerly()), (
"Prefetching is only supported in graph mode currently")
if self._prefetch_on_device:
self._dataset = dataset.apply(
prefetching_ops_v2.prefetch_to_devices(self._devices))
else:
# TODO(priyag): If dropping remainder is not appropriate, find another
# approach to distributing the dataset when not possible to divide evenly.
# Possibly not an issue when we start using PartitionedDataset.
self._dataset = dataset.batch(len(devices), drop_remainder=True)
def make_one_shot_iterator(self):
"""Get a one time use iterator for the distributed PerDeviceDataset."""
dataset_iterator = self._dataset.make_one_shot_iterator()
return PerDeviceDataIterator(dataset_iterator, self._devices,
self._prefetch_on_device)
def make_initializable_iterator(self):
"""Get an initializable iterator for the distributed PerDeviceDataset."""
dataset_iterator = self._dataset.make_initializable_iterator()
return PerDeviceDataIterator(dataset_iterator, self._devices,
self._prefetch_on_device)
class PerDeviceDataIterator(object):
"""An iterator (like `tf.data.Iterator`) into a `PerDeviceDataset`."""
def __init__(self, iterator, devices, prefetch_on_device=None):
self._iterator = iterator
self._devices = devices
self._prefetch_on_device = prefetch_on_device
@property
def initializer(self):
return self._iterator.initializer
def get_next(self, name=None):
"""Scatter the input across devices."""
if self._prefetch_on_device:
data_list = self._iterator.get_next(name=name)
index = dict(zip(self._devices, data_list))
else:
batch = self._iterator.get_next(name=name)
index = {}
def get_ith(i):
return lambda x: x[i]
for i, d in enumerate(self._devices):
index[d] = nest.map_structure(get_ith(i), batch)
if context.executing_eagerly():
with ops.device(d):
index[d] = nest.map_structure(array_ops.identity, index[d])
return regroup(index)
据我所知,首先,我的dataset_fn()
函数被调用来获取数据集对象,然后在其之上应用一批具有GPU数量的大小。该批次的元素必须是我在dataset_fn()
内的数据集初始化中定义的实际批次,已分配给不同的设备。
答案 1 :(得分:0)
如果有帮助,我会做一些澄清,但真的不确定这是否是您的意思。
MirroredStrategy是否仅使用一名工作人员?
是的。 MirroredStrategy旨在仅在一台Worker上工作(也就是一台节点,一台计算机等)
我需要指定等于一塔容量的批次大小
不。您需要将批次大小乘以塔数之和。
注意:仅供参考,Tower是模型的副本,等于GPU的数量,也称为副本
从此Keras tutorial开始,这是简单计算批次大小的方法:
BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)
在这种情况下,每个GPU的批处理大小为64。然后乘以GPU的数量。 为什么要乘以GPU数量? 计算梯度和损耗。它将除以批处理大小的总和(而不是GPU批处理大小)
- 措辞怪异:
这是将MirroredStrategy与Multi-WorkerStrategy进行比较。对于集群,您的塔将被复制到每个工作者(例如,在此示例中为2个节点)。每个工作人员将负责将模型分发到他们的GPU(例如,在这种情况下为4个GPU)。在该示例中,您将有8个模型副本。
[...]然后,就像在MirroredStrategy(???)中一样,每个副本都使用自己的变量副本[...]
无论您使用多工作人员还是单个工作人员,每个GPU(或副本)都将独立计算其模型并随后进行同步。 我猜他们会提到“变量的副本”,因为存在另一种带有参数服务器(ps)的分布式计算拓扑,其中ps将收集所有副本的权重,求和,然后将其重新分配给所有副本,以进行下一轮。