将功能卸载到Intel Xeon Phi所需的时间

时间:2017-11-28 22:39:05

标签: c hpc icc xeon-phi intel-mic

卸载调用是否需要预定义的时间将功能的数据(参数)从主机传输到Intel MIC(Xeon Phi协处理器3120系列)?

具体来说,我为我想要在MIC上执行的功能卸载调用(“#pragma offload target(mic)”)。该函数有15个参数(指针和变量),我已经确认了MIC上参数的正确传递。但是我已经简化了代码以检查传递参数的时间,因此它只包含一个简单的“printf()”函数。我使用“sys / time.h”头文件的“gettimeofday()”来测量时间,如下面的代码所示:

主机的一些硬件信息: 英特尔(R)酷睿(TM)i7-3770 CPU @ 3.40GHz / CentOS版本6.8 / PCI Express版本2.0

的main.c

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <string.h>

__attribute__ (( target (mic))) unsigned long long ForSolution = 0;
__attribute__ (( target (mic))) unsigned long long sufficientSol = 1;
__attribute__ (( target (mic))) float timer = 0.0;

__attribute__ (( target (mic))) void function(float *grid, float *displ, unsigned long long *li, unsigned long long *repet, float *solution, unsigned long long dim, unsigned long long numOfa, unsigned long long numLoops, unsigned long long numBlock, unsigned long long thread, unsigned long long blockGrid, unsigned long long station, unsigned long long bytesSol, unsigned long long totalSol, volatile unsigned long long *prog);

   float    *grid, *displ, *solution;
   unsigned long long   *li,repet;
   volatile unsigned long long  *prog;
   unsigned long long dim = 10, grid_a = 3, numLoops = 2, numBlock = 0;
   unsigned long long thread = 220, blockGrid = 0, station = 12;
   unsigned long long station_at = 8, bytesSol, totalSol;

   bytesSol = dim*sizeof(float);
   totalSol = ((1024 * 1024 * 1024) / bytesSol) * bytesSol;



   /******** Some memcpy() functions here for the pointers*********/                   



gettimeofday(&start, NULL);

   #pragma offload target(mic) \
        in(grid:length(dim * grid_a * sizeof(float))) \
        in(displ:length(station * station_at * sizeof(float))) \
        in(li:length(dim * sizeof(unsigned long long))) \
        in(repet:length(dim * sizeof(unsigned long long))) \
        out(solution:length(totalSol/sizeof(float))) \
        in(dim,grid_a,numLoops,numBlock,thread,blockGrid,station,bytesSol,totalSol) \
        in(prog:length(sizeof(volatile unsigned long long))) \
        inout(ForSolution,sufficientSol,timer)
   {
        function(grid, displ, li, repet, solution, dim, grid_a, numLoops, numBlock, thread, blockGrid, station, bytesSol, totalSol, prog);
   }

    gettimeofday(&end, NULL);  

    printf("Time to tranfer data on Intel Xeon Phi: %f sec\n", (((end.tv_sec - start.tv_sec) * 1000000.0 + (end.tv_usec - start.tv_usec)) / 1000000.0) - timer);
    printf("Time for calculations: %f sec\n", timer);

function.c

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include <string.h>
#include <omp.h>

void function(float *grid, float *displ, unsigned long long *li, unsigned long long *repet, float *solution, unsigned long long dim, unsigned long long numOfa, unsigned long long numLoops, unsigned long long numBlock, unsigned long long thread, unsigned long long blockGrid, unsigned long long station, unsigned long long bytesSol, unsigned long long totalSol, volatile unsigned long long *prog)
{
    struct timeval      timer_start, timer_end;

    gettimeofday(&timer_start, NULL);

printf("Hello World!!!\n");


    gettimeofday(&timer_end, NULL);

    timer = ((timer_end.tv_sec - timer_start.tv_sec) * 1000000.0 + (timer_end.tv_usec - timer_start.tv_usec)) / 1000000.0 ;  
}

终端的结果:

Time to tranfer data on Intel Xeon Phi: 3.512706 sec
Time for calculations: 0.000002 sec
Hello World!!!

代码需要3.5秒才能完成“卸载目标”。以上结果是否正常?有没有办法减少卸载调用的重要时间延迟?

1 个答案:

答案 0 :(得分:4)

让我们看看这里的步骤:

a)对于第一个#pragma offload MIC初始化;这可能包括重置它,启动一个剥离的Linux(并等待它启动所有CPU,初始化其内存管理,启动伪网卡驱动程序等),并将您的代码上传到设备。这可能需要几秒钟。

b)所有输入数据都上传到MIC。

c)执行该功能。

d)所有输出数据都从MIC下载。

对于PCI Express Revision 2.0(x16)上的原始数据传输,最大值带宽为8 GB / s;但是你不会得到最大值。带宽。据我记得,与Phi的沟通涉及共享环缓冲区和&#34;门铃&#34;使用&#34;伪NIC&#34;的IRQ双方的驱动程序(在主机上,在协处理器的操作系统上);如果你得到一半的最大值,我会惊讶地发现所有的握手和开销。带宽。

我认为上传的代码总量,上传的数据和下载的数据远远超过1 GiB(例如,out(solution:length(totalSol/sizeof(float)))本身就是1 GiB)。如果我们假设&#34;约4 GiB / s&#34;那至少是另一个~250毫秒。

我的建议是做两件事;并测量第一次(包括初始化所有内容)和第二次(当所有内容都已初始化时)之间的差异,以确定初始化协处理器所需的时间。第二次测量(减去执行函数的时间)将告诉您数据传输的时间。