因此,我对Google TPU陌生。根据我已经研究的内容,它已针对训练TensorFlow上的机器学习模型进行了专门优化。
目前,我正在尝试查看TPU在其他类型的功能上的性能。这些功能与机器学习无关。
我一直在尝试修改我的代码,以便它可以在Google Colab中的TPU上运行,但是我不确定它是否有效,或者这是否是最佳方法。
这是O(n3)
矩阵乘法算法的代码:
import os
import numpy as np
from random import seed
from random import random
import tensorflow as tf
import time;
#check that this is running on the TPU
try:
tpu = tf.contrib.cluster_resolver.TPUClusterResolver() # TPU detection
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
except ValueError:
print("Running on GPU or CPU")
tpu = None
#TPU details
if 'COLAB_TPU_ADDR' not in os.environ:
print('ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!')
else:
tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print ('TPU address is', tpu_address)
def multiplicationComputation():
#size of matrix
row_size = 128
col_size = 128
N = row_size*col_size
#class for matrix
class MatrixMultiplication:
matrix1 = np.empty(N) #DO NOT USE np.arange(N)
matrix2 = np.empty(N)
product = np.empty(N) #product size is the matrix1.columns x matrix2.rows
#create MatrixMultiplication object
m = MatrixMultiplication()
#fill objects's data structures
#seed for matrix 1
seed(1)
for x in range(N):
value = random()
m.matrix1[x] = value
#seed for matrix 2
seed(7)
for x in range(N):
value = random()
m.matrix2[x] = value
#multiply matrix1 and matrix2
start = time.time()
qtySaves = 0;
for i in range(row_size):
for j in range(col_size):
i_col = i * col_size
sum = 0
for k in range(row_size):
k_col = k * col_size
multiplication = m.matrix1[i_col + k] * m.matrix2[k_col + j]
sum = sum + multiplication
m.product[i_col + j] = sum #The result of the multiplication is saved on the product matrix
qtySaves = qtySaves + 1
end = time.time()
#print result
print()
print("Result O(n^3): ")
for i in range(N):
if i % row_size == 0 and i > 0:
print()
print(str(m.product[i]), end =" ")
print()
print("For n = " + str(N) + ", time is " + str(end - start))
#rewrite computation so it can be executed on the TPU
#tpuOperation = tf.contrib.tpu.rewrite(multiplicationComputation)
tpuOperation = tf.contrib.tpu.batch_parallel(multiplicationComputation, [], num_shards=8)
#run
session = tf.Session(tpu_address, config=tf.ConfigProto(isolate_session_state=True, log_device_placement=True)) #isolate session state = True for distributed runtime
try:
session.run(tf.contrib.tpu.initialize_system()) #initializes a distributed TPU system
session.run(tpuOperation)
finally:
#TPU sessions must be shutdown separately from closing the session
session.run(tf.contrib.tpu.shutdown_system())
session.close()
我担心这不能在TPU上运行。调用session.list_devices()
时,我看到列出了一个CPU,而且恐怕我的代码可能实际上正在CPU上而不是TPU上运行。这是上述命令的输出:
TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 10448234186946304259),
_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 2088593175391423031),
_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 1681908406791603718),
_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 2618396797726491975),
_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 14243051360425930068),
_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 15491507241115490455),
_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 9239156557030772892),
_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 16970377907446102335),
_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 6145936732121669294),
_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 11372860691871753999),
_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184, 12653526146081894211)]
就目前而言,我不寻求有关使用哪种加速器的建议。我想测试TPU,并确保我的代码正在运行。请帮忙!
答案 0 :(得分:0)
恐怕是否存在张量流对np
操作的执行方式没有影响。
在上面的示例中,当您指定
tpuOperation = tf.contrib.tpu.batch_parallel(multiplicationComputation, [], num_shards=8)
其中multiplicationComputation
没有要并行化的TPU特定代码,它的运行方式与您在CPU上调用multiplicationComputation
时的正常运行方式相同。
您将必须使用TF操作重写代码以使其能够在GPU上运行。 Tensorflow会将您的操作转换为TPU特定的代码。
答案 1 :(得分:0)
如果您想轻松地将TPU与其他硬件进行比较,建议您使用estimator api。
TPU经过优化,可以拟合和推断ML模型,因此它们可以快速进行矩阵乘法,但是任何尝试使用双循环对此进行评估的代码似乎都不会使您对芯片的性能有很好的了解。< / p>