与默认计数器相比,GPU上的计数器非常慢?

时间:2015-11-27 11:59:32

标签: python cuda tensorflow

编辑 - 请参阅底部的编辑,gpu上的张量流增加了大量计数器向量的速度。

我正在试图看看使用GPU是否给了我任何速度优势,而这个以下程序只需要200,000次,一次使用张量流和GPU,另一次使用plain-ol-python。张量流循环运行需要14秒才能运行,而普通ol python只需要0.013秒?我究竟做错了什么?代码如下:

RunWith

输出此

SELECT 
  (SELECT CAST([VALUE] AS CHAR(1)) 
   FROM yourtable
   ORDER BY ID
   FOR XML PATH ('')
  )

编辑 - 在改变计数器向量的建议之后,gpu上的张量流速度令人难以置信地更快。

每个载体有10,000个计数器:

#!/usr/bin/env python
import tensorflow as tf
import sys, time
# Create a Variable, that will be initialized to the scalar value 0.
state = tf.Variable(0, name="counter")                                                                                            

MAX=10000

# Create an Op to add one to `state`.
one = tf.constant(1)
new_value = tf.add(state, one)
update = tf.assign(state, new_value)

# Variables must be initialized by running an `init` Op after having

# launched the graph.  We first have to add the `init` Op to the graph.
init_op = tf.initialize_all_variables()

if __name__ == '__main__' :
    # Launch the graph and run the ops.
    with tf.Session() as sess:
        # Run the 'init' op
        sess.run(init_op)
        # Print the initial value of 'state'
        print sess.run(state)
        # Run the op that updates 'state' and print 'state'.
        print "starting ..."
        t0 = time.time()
        for _ in range(int(sys.argv[1]) if len(sys.argv) > 1 else MAX):
            sess.run(update)

        print str(sess.run(state)) + str(time.time() - t0) 

        count = 0 
        print "starting ..."
        t0 = time.time()
        for _ in range(int(sys.argv[1]) if len(sys.argv) > 1 else MAX):
            count+=1

        print str(count) + str(time.time() - t0) 

输出:

$ ./helloworld.py 200000
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 8
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:888] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:88] Found device 0 with properties: 
name: GeForce GTX 970
major: 5 minor: 2 memoryClockRate (GHz) 1.3165
pciBusID 0000:01:00.0
Total memory: 4.00GiB
Free memory: 3.69GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:112] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:122] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:643] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/gpu/gpu_region_allocator.cc:47] Setting region size to 3649540096
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 8
0
starting ...
20000014.444382906
starting ...
2000000.0131969451904

有100,000个计数器,输出为:

#!/usr/bin/env python
import tensorflow as tf
import sys, time

CSIZE=10000
# Create a Variable, that will be initialized to the scalar value 0.
state = tf.Variable([0 for x in range(CSIZE)], name="counter")

MAX=1000

# Create an Op to add one to `state`.
one = tf.constant([1 for x in range(CSIZE)])
new_value = tf.add(state, one)
update = tf.assign(state, new_value)

# Variables must be initialized by running an `init` Op after having

# launched the graph.  We first have to add the `init` Op to the graph.
init_op = tf.initialize_all_variables()

if __name__ == '__main__' :
    # Launch the graph and run the ops.
    with tf.Session() as sess:
        # Run the 'init' op
        sess.run(init_op)
        # Print the initial value of 'state'
        print sess.run(state)
        # Run the op that updates 'state' and print 'state'.
        print "starting ..."
        t0 = time.time()
        for _ in range(int(sys.argv[1]) if len(sys.argv) > 1 else MAX):
            sess.run(update)

        print str(sess.run(state)) + str(time.time() - t0) 

        counters = [0 for x in range(CSIZE)]                                                                                      
        print "starting ..."
        t0 = time.time()
        for _ in range(int(sys.argv[1]) if len(sys.argv) > 1 else MAX):
            for x in range(0, len(counters)) :
                counters[x]+=1

        print str(counters[0]) + ", " +  str(time.time() - t0) 
普通的ol python花了一分钟直到我放弃了

1 个答案:

答案 0 :(得分:3)

从某种意义上说,与必须执行的指令数量相比,两个程序都“惊人地”缓慢。单元素计数器在14.4秒内执行200,000个增量指令,使用200,000次调用sess.run()。向量计数器在0.99秒内使用10,000次Session.run()调用执行100,000,000个增量指令。如果你用C编写这些程序,你会发现每个计数器增量最多需要几纳秒,所以花费的时间在哪里?

TensorFlow施加一些每步开销,每次调用VariableOp几微秒。这是一个known issue,这是团队正在努力减少的事情,但对于大多数神经网络算法而言,这通常只能在一个步骤中运行。开销可以按如下方式细分:

  • 每步调度开销:TensorFlow会话API是基于字符串的,因此必须进行一些字符串操作和散列以识别要在每个步骤中运行的正确子图。这涉及一些Python和一些C ++代码。
  • Per-op dispatch overhead:这是在C ++中实现的,涉及设置上下文和调度TensorFlow内核。您的计数器基准测试中有三个操作(AddAssignsess.run(update))。
  • GPU内核调度开销:将内核调度到GPU涉及调用GPU驱动程序的内核条目。
  • GPU复制开销:也许令人惊讶的是,update会将结果从GPU复制回来,因为Tensor是一个state.assign_add(one)对象(对应于结果) (分配),其值将从通话中返回。

有些事情你可以尝试加速两个版本的代码。

  • 使用tf.add代替单独的tf.assignsess.run(update.op)操作将减少每操作调度开销(并且还可以进行更有效的就地添加)。

  • 调用{{1}}将避免在每一步中将副本复制回客户端。