我在理解比较结果时遇到了一些问题。
配备i7 / intel hd4000的笔记本电脑和配备8Xeon 5400/7970 HDRadeon的服务器。
我在成倍增长:
int M =1024*2, N = 1024*6, P = 1024*2;
// N P
//|-----------| |-----------|
//| | | |
//|M | * |N |
//| | | |
//|-----------| |-----------|
这是内核:
/*
* Copyright 1993-2010 NVIDIA Corporation. All rights reserved.
*
* Please refer to the NVIDIA end user license agreement (EULA) associated
* with this source code for terms and conditions that govern your use of
* this software. Any use, reproduction, disclosure, or distribution of
* this software and related documentation outside the terms of the EULA
* is strictly prohibited.
*
*/
/* Matrix multiplication: C = A * B.
* Device code.
*/
#ifndef BLOCK_SIZE
#define BLOCK_SIZE 16
#endif
#define AS(i, j) As[j + i * BLOCK_SIZE]
#define BS(i, j) Bs[j + i * BLOCK_SIZE]
///////////////////////////////////////////////////////////////////////////////
//! Matrix multiplication on the device: C = A * B
//! uiWA is A's width and uiWB is B's width
////////////////////////////////////////////////////////////////////////////////
__kernel void
m_m_mul( __global float* A, __global float* B, __global float* C,
/*__local float* As, __local float* Bs,*/ int uiWA, int uiWB, int trueLocalSize1)
{
__local float As[BLOCK_SIZE*BLOCK_SIZE];
__local float Bs[BLOCK_SIZE*BLOCK_SIZE];
// Block index
int bx = get_group_id(0);
int by = get_group_id(1);
// Thread index
int tx = get_local_id(0);
int ty = get_local_id(1);
// Index of the first sub-matrix of A processed by the block
int aBegin = uiWA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
int aEnd = aBegin + uiWA - 1;
// Step size used to iterate through the sub-matrices of A
int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
int bStep = BLOCK_SIZE * uiWB;
// Csub is used to store the element of the block sub-matrix
// that is computed by the thread
float Csub = 0.0f;
// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep) {
// Load the matrices from device memory
// to shared memory; each thread loads
// one element of each matrix
AS(ty, tx) = A[a + uiWA * ty + tx];
BS(ty, tx) = B[b + uiWB * ty + tx];
// Synchronize to make sure the matrices are loaded
barrier(CLK_LOCAL_MEM_FENCE);
// Multiply the two matrices together;
// each thread computes one element
// of the block sub-matrix
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += AS(ty, k) * BS(k, tx);
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
barrier(CLK_LOCAL_MEM_FENCE);
}
if (get_global_id(1) < trueLocalSize1)
// Write the block sub-matrix to device memory;
// each thread writes one element
C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = Csub;
}
我与Eigen::Matrix<float,-1,-1,Eigen::RowMajor> m4 = m1 * m2
比较;
在服务器上:
Creating matrices on GPU....... Done [0ms]
Creating matrices on CPU....... Done [0ms]
Filling GPU with random numbers....... Done [19ms]
M3 = M1 * M2... on GPU (Loading Kernels)... Done [240ms]
M3 = M1 * M2... on GPU (3 times)... Done [211ms]
Loading M1, M2 on GPU... Done [93ms]
M4 = M1 * M2 on CPU... Done [7775ms] Error:3.78049e-008
Press any key to continue . . .
Matlab: Elapsed time is 3.010626 seconds.
在笔记本电脑上:
Creating matrices on GPU....... Done [22ms]
Creating matrices on CPU....... Done [0ms]
Filling GPU with random numbers....... Done [35ms]
M3 = M1 * M2... on GPU (Loading Kernels)... Done [2975ms]
M3 = M1 * M2... on GPU (3 times)... Done [6891ms]
Loading M1, M2 on GPU... Done [80ms]
M4 = M1 * M2 on CPU... Done [5966ms] Error:3.78049e-008
Press any key to continue . . .
Matlab: Elapsed time is 2.310626 seconds.
我现在的问题。 1)为什么笔记本电脑的特性比xeon的8核更快。可能是因为本征只在两个系统上使用一个核心而i7具有更高的时钟速度?2.0 vs 2.4?
2)在labtop上使用Intel HD4000与Eigen的速度提高了近3倍,但Matlab在相同的乘法中需要2.3秒。这与HD4000上的内核相同。 (我能做些什么让Eigen以与Matlab相同的速度运行吗?)
答案 0 :(得分:1)
http://eigen.tuxfamily.org/dox/TopicMultiThreading.html
在visual studio中启用OpenMP使我的代码以8核运行并且速度显着降低。运行时间80%的matlabs速度。
Cores: 8
M: 4096 N:12288 P:4096
Creating matrices on GPU....... Done [0ms]
Creating matrices on CPU....... Done [0ms]
Filling GPU with random numbers....... Done [44ms]
M3 = M1 * M2... on GPU (Loading Kernels)... Done [850ms]
M3 = M1 * M2... on GPU (3 times)... Done [2063ms]
Loading M1, M2 on GPU... Done [355ms]
M4 = M1 * M2 on CPU... Done [22263ms] Error:5.70124e-007
Press any key to continue . . .