Question

通常，有一些很好的示例使用TF优化器解决一般（非深度学习）问题。鉴于：

https://databricks.com/tensorflow/training-and-convergence https://colab.research.google.com/notebooks/tpu.ipynb#scrollTo=a_rjVo-RAoYd

我们希望能够将以上两者结合起来，并利用基于TPU的优化来解决高维问题。

为此，我有一个简单的colab代码，该代码将上面的两个示例合并在一起：

import tensorflow as tf
import numpy as np
from tensorflow.contrib.tpu.python.tpu import tpu_function
import os
import pprint
import tensorflow as tf

if 'COLAB_TPU_ADDR' not in os.environ:
  print('ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!')
else:
  tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
  print ('TPU address is', tpu_address)

  with tf.Session(tpu_address) as session:
    devices = session.list_devices()

  print('TPU devices:')
  pprint.pprint(devices)

# Add this somewhere at the top
tpu_function.get_tpu_context().set_number_of_shards(8)

# x and y are placeholders for our training data
x = tf.placeholder("float")
y = tf.placeholder("float")
# w is the variable storing our values. It is initialised with starting "guesses"
# w[0] is the "a" in our equation, w[1] is the "b"
w = tf.Variable([1.0, 2.0,3.0, 4.0], name="w")
# Our model of y = a*x + b
y_model = tf.multiply(x, w[0]) + w[1] + w[2] +3

# Our error is defined as the square of the differences
error = tf.square(y - y_model)
# The Gradient Descent Optimizer does the heavy lifting
train_op = tf.train.AdamOptimizer(0.01)
optimizer = tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error) # TPU change 1





# Normal TensorFlow - initialize values, create a session and run the model
model = tf.global_variables_initializer()

with tf.Session(tpu_address) as session:
    session.run(tf.contrib.tpu.initialize_system())
    print('init')
    session.run(model)
    for i in range(10000):
        print(i)
        x_value = np.random.rand()
        y_value = x_value * 2 + 6 + 5 + 3
        session.run(optimizer, feed_dict={x: x_value, y: y_value})

    w_value = session.run(w)
    print("Predicted model: {a:.3f}x + {b:.3f}+{c:.3f}x + {d:.3f}".format(a=w_value[0], b=w_value[1], c=w_value[2], d=w_value[3]))
    session.run(tpu.shutdown_system())

当我在colab中运行它时，它只运行第一次循环打印：

init
0

然后什么也不做，colab只会不断扩展。

如果我不使用

optimizer = tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error)

以及其他TPU功能，然后可以很好地估计w变量。

问题是：

这为什么不起作用，我们如何获得交叉分片复制器来优化此简单功能？
我如何整形变量w以在TPU上使用并行批次/碎片？
我们如何通过使用等效的数据集prefetch操作或使用馈入队列来提高效率？

目标是例如使用不带TPUEstimator的lower level TPU API，通过仅使用张量，队列和分片来利用TPU的功能来帮助解决自定义问题。

Answer 1

它不起作用，因为您要覆盖分片的数量，而没有实际将计算拆分为分片。运行您的代码时，出现以下错误：

InternalError: From /job:tpu_worker/replica:0/task:0:
RET_CHECK failure (platforms/xla/service/jellyfish/lowering/all_reduce_emitter.cc:832) replica_id < target.ReplicaCount() Unexpected replica id in all-reduce, replica_id is 1, target has 1 replicas.


Error encountered while compiling %all-reduce.7 = f32[4]{0:T(256)} all-reduce(f32[4]{0:T(256)} %arg0.1), replica_groups={{0,1,2,3,4,5,6,7}}, to_apply=%sum.3, metadata={op_type="CrossReplicaSum" op_name="CrossReplicaSum_21"}, backend_config="{barrier_type:3}".

它试图对8个分片执行计算并合并结果，但是只能使用一个分片。看一下tf.contrib.tpu.shard。它使用给定数量的分片创建分片上下文，并在这些分片上分配计算。因此，您可以像平常一样定义变量，然后将所有计算结果与要包装的函数一起包装，以代替手动设置分片的数量：

# REMOVE THIS
# tpu_function.get_tpu_context().set_number_of_shards(8)

# x and y are placeholders for our training data
x_placeholder = tf.placeholder("float")
y_placeholder = tf.placeholder("float")

# w is the variable storing our values. It is initialised with starting "guesses"
# w[0] is the "a" in our equation, w[1] is the "b"
w = tf.Variable([1.0, 2.0,3.0, 4.0], name="w")

# Wrap all of our tensorflow operations in a function we can shard
def calculations(x, y):
  # Our model of y = a*x + b
  y_model = tf.multiply(x, w[0]) + w[1] + w[2] +3

  # Our error is defined as the square of the differences
  # Average across the entire batch
  error = tf.reduce_mean(tf.square(y - y_model))
  # The Gradient Descent Optimizer does the heavy lifting
  train_op = tf.train.AdamOptimizer(0.01)

  return tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error)

# Shard the function so that its calculation is distributed
optimizer = tf.contrib.tpu.shard(calculations, inputs=[x_placeholder, y_placeholder], num_shards=8)

您无需塑造w即可使用分片，因为分片在批次维度上进行，并且所有输入仅具有一组权重。您需要在输入中添加一个批次维度，以便每个批次可以分布在各个核心上。 shard假定第一个维度是批处理维度，但是包含一个参数，如果数据的形状不同，则可以更改它。 According to the TPU troubleshooting page，理想的批处理大小为1024，因此每个TPU内核有128个样本。如果对于您的模型而言太大，则可以减小，只要它是128的倍数即可。请查看上面的链接和performance guide，以获取有关提高性能的更多提示。
```
for i in range(1000):
    print(i)
    x_value = np.random.rand(1024) # Generate a batch of 1024 values
    y_value = x_value * 2 + 6 + 5 + 3
    session.run(optimizer, feed_dict={x_placeholder: x_value, y_placeholder: y_value})
```
其他所有内容都应保持不变。我能够训练所有10000次迭代的模型。请记住，对于这个简单的模型，它可能会比使用CPU / GPU慢，但是对于较大的数据集，如果遇到更复杂的问题，应该期待性能的提高。
我对数据集或馈入队列还不够熟悉，无法对此发表评论，但是shard包含了馈入队列的参数，因此它很可能支持它们。您可能需要试用一下，才能了解如何将数据获取到计算功能。

基本的TPU跨分片优化器不起作用

1 个答案: