据我了解,只要它们是独立的,TF就会并行调用多个运算符。 (link)
如果运算符在CPU(link)上运行,则inter_op_parallelism_threads
和intra_op_parallelism_threads
可以控制并行度。但是,这些参数根本不会影响GPU运营商。如何控制GPU的并行性? (例如,虽然有独立的运算符,但是串行运行运算符)
编辑:
a=tf.random_normal([N,N])
b=tf.random_normal([N,N])
c=tf.random_normal([N,N])
d=tf.random_normal([N,N])
x=tf.matmul(a,b)
y=tf.matmul(c,d)
z=tf.matmul(x,y)
答案 0 :(得分:2)
这是一种分析执行的方法,可以避免常见的陷阱:
# Turn off graph-rewriting optimizations
config = tf.ConfigProto(graph_options=tf.GraphOptions(optimizer_options=tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)))
# throw error if explicit device placement can't be satisfied
config.allow_soft_placement = False
N = 8192
with tf.device("/gpu:0"):
input1 = tf.Variable(tf.random_normal([N,N]))
input2 = tf.Variable(tf.random_normal([N,N]))
result = tf.matmul(input1, input2)
result_no_output = result.op # to avoid transferring data back to Python
sess = tf.Session(config=config)
# load values into GPU
sess.run(tf.global_variables_initializer())
# pre-warming
sess.run(result_no_output)
num_ops = N**3 + N**2*(N-1) # N^3 muls, N^2 (N-1) adds
elapsed = []
for i in range(10):
start = time.time()
sess.run(result_no_output)
elapsed.append(time.time()-start)
print("%d x %d matmul, %.2f elapsed, %.2f G ops/sec"%(N, N, min(elapsed), num_ops/min(elapsed)/10**9))
在TitanX pascal上显示9.5 T ops / sec,接近理论最大值11 T ops / sec理论最大值
8192 x 8192 matmul, 0.12 elapsed, 9527.10 G ops/sec