矢量化与非矢量化与多线程Julia码的性能

时间:2018-04-13 06:33:04

标签: multithreading performance julia benchmarking

我有一大堆浮点数。我将数组乘以标量。朱莉娅最好(最快)的方式是什么?

decay1

decay2是向量化实现,decay3是devectorized,Vectorized: 2.291 ms (4 allocations: 112 bytes) Devectorized: 2.221 ms (0 allocations: 0 bytes) Multithreaded: 1.963 ms (1 allocation: 32 bytes) Multithreaded array muliplication: 87.418 ms (2 allocations: 30.52 MiB) Scale: 2.042 ms (0 allocations: 0 bytes) 是多线程的,有4个线程。 我看到了以下时间。

K = 10
myarray1 = tf.placeholder(tf.float32, shape=[None,5,5]) # shape = [None, 5, 5]
myarray2 = tf.Variable( np.zeros([K,5,5]), dtype=tf.float32 )
vals = []
for k in range(0,K):
    tmp = tf.reduce_sum(myarray1*myarray2[k],axis=(1,2))
    vals.append(tmp)

result = tf.min( tf.stack(vals,axis=-1), axis=-1 )

加速量显然太低了。我究竟做错了什么?我怎样才能做得更好?

0 个答案:

没有答案