基本的TPU跨分片优化器不起作用

时间:2019-03-14 16:23:16

标签: tensorflow tpu

通常,有一些很好的示例使用TF优化器解决一般(非深度学习)问题。鉴于:

https://databricks.com/tensorflow/training-and-convergence https://colab.research.google.com/notebooks/tpu.ipynb#scrollTo=a_rjVo-RAoYd

我们希望能够将以上两者结合起来,并利用基于TPU的优化来解决高维问题。

为此,我有一个简单的colab代码,该代码将上面的两个示例合并在一起:

import tensorflow as tf
import numpy as np
from tensorflow.contrib.tpu.python.tpu import tpu_function
import os
import pprint
import tensorflow as tf

if 'COLAB_TPU_ADDR' not in os.environ:
  print('ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!')
else:
  tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
  print ('TPU address is', tpu_address)

  with tf.Session(tpu_address) as session:
    devices = session.list_devices()

  print('TPU devices:')
  pprint.pprint(devices)

# Add this somewhere at the top
tpu_function.get_tpu_context().set_number_of_shards(8)

# x and y are placeholders for our training data
x = tf.placeholder("float")
y = tf.placeholder("float")
# w is the variable storing our values. It is initialised with starting "guesses"
# w[0] is the "a" in our equation, w[1] is the "b"
w = tf.Variable([1.0, 2.0,3.0, 4.0], name="w")
# Our model of y = a*x + b
y_model = tf.multiply(x, w[0]) + w[1] + w[2] +3

# Our error is defined as the square of the differences
error = tf.square(y - y_model)
# The Gradient Descent Optimizer does the heavy lifting
train_op = tf.train.AdamOptimizer(0.01)
optimizer = tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error) # TPU change 1





# Normal TensorFlow - initialize values, create a session and run the model
model = tf.global_variables_initializer()

with tf.Session(tpu_address) as session:
    session.run(tf.contrib.tpu.initialize_system())
    print('init')
    session.run(model)
    for i in range(10000):
        print(i)
        x_value = np.random.rand()
        y_value = x_value * 2 + 6 + 5 + 3
        session.run(optimizer, feed_dict={x: x_value, y: y_value})

    w_value = session.run(w)
    print("Predicted model: {a:.3f}x + {b:.3f}+{c:.3f}x + {d:.3f}".format(a=w_value[0], b=w_value[1], c=w_value[2], d=w_value[3]))
    session.run(tpu.shutdown_system())

当我在colab中运行它时,它只运行第一次循环打印:

init
0

然后什么也不做,colab只会不断扩展。

如果我不使用

optimizer = tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error) 

以及其他TPU功能,然后可以很好地估计w变量。

问题是:

  1. 这为什么不起作用,我们如何获得交叉分片复制器来优化此简单功能?
  2. 我如何整形变量w以在TPU上使用并行批次/碎片?
  3. 我们如何通过使用等效的数据集prefetch操作或使用馈入队列来提高效率?

目标是例如使用不带TPUEstimator的lower level TPU API,通过仅使用张量,队列和分片来利用TPU的功能来帮助解决自定义问题。

1 个答案:

答案 0 :(得分:1)

  1. 它不起作用,因为您要覆盖分片的数量,而没有实际将计算拆分为分片。运行您的代码时,出现以下错误:

    InternalError: From /job:tpu_worker/replica:0/task:0:
    RET_CHECK failure (platforms/xla/service/jellyfish/lowering/all_reduce_emitter.cc:832) replica_id < target.ReplicaCount() Unexpected replica id in all-reduce, replica_id is 1, target has 1 replicas.
    
    
    Error encountered while compiling %all-reduce.7 = f32[4]{0:T(256)} all-reduce(f32[4]{0:T(256)} %arg0.1), replica_groups={{0,1,2,3,4,5,6,7}}, to_apply=%sum.3, metadata={op_type="CrossReplicaSum" op_name="CrossReplicaSum_21"}, backend_config="{barrier_type:3}".
    

    它试图对8个分片执行计算并合并结果,但是只能使用一个分片。看一下tf.contrib.tpu.shard。它使用给定数量的分片创建分片上下文,并在这些分片上分配计算。因此,您可以像平常一样定义变量,然后将所有计算结果与要包装的函数一起包装,以代替手动设置分片的数量:

    # REMOVE THIS
    # tpu_function.get_tpu_context().set_number_of_shards(8)
    
    # x and y are placeholders for our training data
    x_placeholder = tf.placeholder("float")
    y_placeholder = tf.placeholder("float")
    
    # w is the variable storing our values. It is initialised with starting "guesses"
    # w[0] is the "a" in our equation, w[1] is the "b"
    w = tf.Variable([1.0, 2.0,3.0, 4.0], name="w")
    
    # Wrap all of our tensorflow operations in a function we can shard
    def calculations(x, y):
      # Our model of y = a*x + b
      y_model = tf.multiply(x, w[0]) + w[1] + w[2] +3
    
      # Our error is defined as the square of the differences
      # Average across the entire batch
      error = tf.reduce_mean(tf.square(y - y_model))
      # The Gradient Descent Optimizer does the heavy lifting
      train_op = tf.train.AdamOptimizer(0.01)
    
      return tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error)
    
    # Shard the function so that its calculation is distributed
    optimizer = tf.contrib.tpu.shard(calculations, inputs=[x_placeholder, y_placeholder], num_shards=8)
    
  2. 您无需塑造w即可使用分片,因为分片在批次维度上进行,并且所有输入仅具有一组权重。您需要在输入中添加一个批次维度,以便每个批次可以分布在各个核心上。 shard假定第一个维度是批处理维度,但是包含一个参数,如果数据的形状不同,则可以更改它。 According to the TPU troubleshooting page,理想的批处理大小为1024,因此每个TPU内核有128个样本。如果对于您的模型而言太大,则可以减小,只要它是128的倍数即可。请查看上面的链接和performance guide,以获取有关提高性能的更多提示。

    for i in range(1000):
        print(i)
        x_value = np.random.rand(1024) # Generate a batch of 1024 values
        y_value = x_value * 2 + 6 + 5 + 3
        session.run(optimizer, feed_dict={x_placeholder: x_value, y_placeholder: y_value})
    

    其他所有内容都应保持不变。我能够训练所有10000次迭代的模型。请记住,对于这个简单的模型,它可能会比使用CPU / GPU慢,但是对于较大的数据集,如果遇到更复杂的问题,应该期待性能的提高。

  3. 我对数据集或馈入队列还不够熟悉,无法对此发表评论,但是shard包含了馈入队列的参数,因此它很可能支持它们。您可能需要试用一下,才能了解如何将数据获取到计算功能。