我使用Python将处理时间与theano(CPU),theano(GPU)和Scikit-learn(CPU)进行了比较。 但是,我得到了奇怪的结果。 这里看一下我绘制的图表。
处理时间比较:
你可以看到scikit-learn的结果比theano(GPU)更快。 我检查其经过时间的程序是从具有n * 40个元素的矩阵计算欧氏距离矩阵。
这是代码的一部分。
points = T.fmatrix("points")
edm = T.zeros_like(points)
def get_point_to_points_euclidean_distances(point_id):
euclideans = (T.sqrt((T.sqr(points- points[point_id, : ])).sum(axis=1)))
return euclideans
def get_EDM_CPU(points):
EDM = np.zeros((points.shape[0], points.shape[0])).astype(np.float32)
for row in range(points.shape[0]):
EDM[row, :] = np.sqrt(np.sum((points - points[row, :])**2, axis=1))
return EDM
def get_sk(points):
EDM = sk.pairwise_distances(a, metric='l2')
return EDM
seq = T.arange(T.shape(points)[0])
(result, _) = theano.scan(fn = get_point_to_points_euclidean_distances, \
outputs_info = None , \
sequences = seq)
get_EDM_GPU = theano.function(inputs = [points], outputs = result, allow_input_downcast = True)
我认为GPU比sci-kit学习慢的原因可能是转移时间。所以我用nvprof命令分析了GPU。然后我明白了。
==27105== NVPROF is profiling process 27105, command: python ./EDM_test.py
Using gpu device 0: GeForce GTX 580 (CNMeM is disabled, cuDNN not available)
data shape : (10000, 40)
get_EDM_GPU elapsed time : 1.84863090515 (s)
get_EDM_CPU elapsed time : 8.09937691689 (s)
get_EDM_sk elapsed time : 1.10968112946 (s)
ratio : 4.38128395145
==27105== Profiling application: python ./EDM_test.py
==27105== Warning: Found 9 invalid records in the result.
==27105== Warning: This could be because device ran out of memory when profiling.
==27105== Profiling result:
Time(%) Time Calls Avg Min Max Name
71.34% 1.28028s 9998 128.05us 127.65us 128.78us kernel_reduce_01_node_316e2e1cbfbe8cfb8e4a101f329ffeec_0(int, int, float const *, int, int, float*, int)
19.95% 357.97ms 9997 35.807us 35.068us 36.948us kernel_Sub_node_bc41b3f8f12c93d29f2c4360ad445d80_0_2(unsigned int, int, int, float const *, int, int, float const *, int, int, float*, int, int)
7.32% 131.38ms 2 65.690ms 1.2480us 131.38ms [CUDA memcpy DtoH]
1.25% 22.456ms 9996 2.2460us 2.1140us 2.8420us kernel_Sqrt_node_23508f8f49d12f3e8369d543f5620c15_0_Ccontiguous(unsigned int, float const *, float*)
0.12% 2.1847ms 1 2.1847ms 2.1847ms 2.1847ms [CUDA memset]
0.01% 259.73us 5 51.946us 640ns 250.36us [CUDA memcpy HtoD]
0.00% 17.086us 1 17.086us 17.086us 17.086us kernel_reduce_ccontig_node_97496c4d3cf9a06dc4082cc141f918d2_0(unsigned int, float const *, float*)
0.00% 2.0090us 1 2.0090us 2.0090us 2.0090us void copy_kernel<float, int=0>(cublasCopyParams<float>)
转移[CUDA memcpy DtoH]执行了两次{ 1.248 [us], 131.38 [ms] }
传输[CUDA memcpy HtoD]执行了5次{ min: 640 [ns], max: 250.36 [us] }
传输时间约为131.639毫秒(131.88毫秒+ 259.73微秒)。 但GPU和scikit-learn之间的差距约为700毫秒(1.8秒--1.1秒)因此,差距超过了传输时间。
它是否只计算来自对称矩阵的上三角矩阵?
是什么让scikit学得如此之快?答案 0 :(得分:2)
我最初的候选人将是:
numpy
矢量化执行对CPU缓存行友好numpy
次转化(测试),.astype()
享受更好的时间安排 GPU-s可以很好地展示它的最佳性能,如果足够的密集卷积重新处理操作 - 如大规模/高分辨率图像处理 - 在{{1卷积内核矩阵如此之小,以至于所有这些[m,n,o]
常量值都可以驻留在SM的本地,在一组可用的SMX-SM_registers内,以及GPU内核启动器是否被3D最佳调整-tblock / grid processing-layout几何,以便全局内存访问延迟处于其最佳屏蔽性能,在硬件WARP对齐的SMx中强制执行所有GPU线程:WarpScheduler RoundRobin线程调度功能(第一次交换Round-Robin进入Greedy-WarpSchedule模式会在GPU内核代码中出现执行路径不同的情况下失去整场战斗。