通常,有一些很好的示例使用TF优化器解决一般(非深度学习)问题。鉴于:
https://databricks.com/tensorflow/training-and-convergence https://colab.research.google.com/notebooks/tpu.ipynb#scrollTo=a_rjVo-RAoYd
我们希望能够将以上两者结合起来,并利用基于TPU的优化来解决高维问题。
为此,我有一个简单的colab代码,该代码将上面的两个示例合并在一起:
import tensorflow as tf
import numpy as np
from tensorflow.contrib.tpu.python.tpu import tpu_function
import os
import pprint
import tensorflow as tf
if 'COLAB_TPU_ADDR' not in os.environ:
print('ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!')
else:
tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print ('TPU address is', tpu_address)
with tf.Session(tpu_address) as session:
devices = session.list_devices()
print('TPU devices:')
pprint.pprint(devices)
# Add this somewhere at the top
tpu_function.get_tpu_context().set_number_of_shards(8)
# x and y are placeholders for our training data
x = tf.placeholder("float")
y = tf.placeholder("float")
# w is the variable storing our values. It is initialised with starting "guesses"
# w[0] is the "a" in our equation, w[1] is the "b"
w = tf.Variable([1.0, 2.0,3.0, 4.0], name="w")
# Our model of y = a*x + b
y_model = tf.multiply(x, w[0]) + w[1] + w[2] +3
# Our error is defined as the square of the differences
error = tf.square(y - y_model)
# The Gradient Descent Optimizer does the heavy lifting
train_op = tf.train.AdamOptimizer(0.01)
optimizer = tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error) # TPU change 1
# Normal TensorFlow - initialize values, create a session and run the model
model = tf.global_variables_initializer()
with tf.Session(tpu_address) as session:
session.run(tf.contrib.tpu.initialize_system())
print('init')
session.run(model)
for i in range(10000):
print(i)
x_value = np.random.rand()
y_value = x_value * 2 + 6 + 5 + 3
session.run(optimizer, feed_dict={x: x_value, y: y_value})
w_value = session.run(w)
print("Predicted model: {a:.3f}x + {b:.3f}+{c:.3f}x + {d:.3f}".format(a=w_value[0], b=w_value[1], c=w_value[2], d=w_value[3]))
session.run(tpu.shutdown_system())
当我在colab中运行它时,它只运行第一次循环打印:
init
0
然后什么也不做,colab只会不断扩展。
如果我不使用
optimizer = tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error)
以及其他TPU功能,然后可以很好地估计w
变量。
问题是:
w
以在TPU上使用并行批次/碎片?prefetch
操作或使用馈入队列来提高效率?目标是例如使用不带TPUEstimator的lower level
TPU API,通过仅使用张量,队列和分片来利用TPU的功能来帮助解决自定义问题。
答案 0 :(得分:1)
它不起作用,因为您要覆盖分片的数量,而没有实际将计算拆分为分片。运行您的代码时,出现以下错误:
InternalError: From /job:tpu_worker/replica:0/task:0:
RET_CHECK failure (platforms/xla/service/jellyfish/lowering/all_reduce_emitter.cc:832) replica_id < target.ReplicaCount() Unexpected replica id in all-reduce, replica_id is 1, target has 1 replicas.
Error encountered while compiling %all-reduce.7 = f32[4]{0:T(256)} all-reduce(f32[4]{0:T(256)} %arg0.1), replica_groups={{0,1,2,3,4,5,6,7}}, to_apply=%sum.3, metadata={op_type="CrossReplicaSum" op_name="CrossReplicaSum_21"}, backend_config="{barrier_type:3}".
它试图对8个分片执行计算并合并结果,但是只能使用一个分片。看一下tf.contrib.tpu.shard。它使用给定数量的分片创建分片上下文,并在这些分片上分配计算。因此,您可以像平常一样定义变量,然后将所有计算结果与要包装的函数一起包装,以代替手动设置分片的数量:
# REMOVE THIS
# tpu_function.get_tpu_context().set_number_of_shards(8)
# x and y are placeholders for our training data
x_placeholder = tf.placeholder("float")
y_placeholder = tf.placeholder("float")
# w is the variable storing our values. It is initialised with starting "guesses"
# w[0] is the "a" in our equation, w[1] is the "b"
w = tf.Variable([1.0, 2.0,3.0, 4.0], name="w")
# Wrap all of our tensorflow operations in a function we can shard
def calculations(x, y):
# Our model of y = a*x + b
y_model = tf.multiply(x, w[0]) + w[1] + w[2] +3
# Our error is defined as the square of the differences
# Average across the entire batch
error = tf.reduce_mean(tf.square(y - y_model))
# The Gradient Descent Optimizer does the heavy lifting
train_op = tf.train.AdamOptimizer(0.01)
return tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error)
# Shard the function so that its calculation is distributed
optimizer = tf.contrib.tpu.shard(calculations, inputs=[x_placeholder, y_placeholder], num_shards=8)
您无需塑造w
即可使用分片,因为分片在批次维度上进行,并且所有输入仅具有一组权重。您需要在输入中添加一个批次维度,以便每个批次可以分布在各个核心上。 shard
假定第一个维度是批处理维度,但是包含一个参数,如果数据的形状不同,则可以更改它。 According to the TPU troubleshooting page,理想的批处理大小为1024,因此每个TPU内核有128个样本。如果对于您的模型而言太大,则可以减小,只要它是128的倍数即可。请查看上面的链接和performance guide,以获取有关提高性能的更多提示。
for i in range(1000):
print(i)
x_value = np.random.rand(1024) # Generate a batch of 1024 values
y_value = x_value * 2 + 6 + 5 + 3
session.run(optimizer, feed_dict={x_placeholder: x_value, y_placeholder: y_value})
其他所有内容都应保持不变。我能够训练所有10000次迭代的模型。请记住,对于这个简单的模型,它可能会比使用CPU / GPU慢,但是对于较大的数据集,如果遇到更复杂的问题,应该期待性能的提高。
我对数据集或馈入队列还不够熟悉,无法对此发表评论,但是shard
包含了馈入队列的参数,因此它很可能支持它们。您可能需要试用一下,才能了解如何将数据获取到计算功能。