我正在开展一个项目,以平行用于VPR(多功能布局布线)工具中放置(放置和布线)的模拟退火算法。
基本上,我需要将该工具使用的众多C文件之一的一部分转换为CUDA C.我只需要将一整段代码并行运行在多个内核上。每个核心都需要处理单独的数据副本。所以我想我需要将数据从主机复制到设备内存。
是否可以在不修改代码的情况下完成整个过程?
正如Janisz的建议,我附上了我感兴趣的部分代码。
while (exit_crit(t, cost, annealing_sched) == 0)
{
//Starting here,I require this part to run on different cores.
//Not the entire while loop.
av_cost = 0.;//These variables should be a local copy for each core.
av_bb_cost = 0.;
av_delay_cost = 0.;
av_timing_cost = 0.;
sum_of_squares = 0.;
success_sum = 0;
inner_crit_iter_count = 1;
for (inner_iter=0; inner_iter < move_lim; inner_iter++) {
//This function try_swap also has to run on different cores and also needs
//to be run on a local copy of data, ie each core needs to completely
//operate on its own data. And this function calls other functions which also have
//the same requirements.
if (try_swap(t, &cost, &bb_cost, &timing_cost,
rlim, pins_on_block, placer_opts.place_cost_type,
old_region_occ_x, old_region_occ_y, placer_opts.num_regions,
fixed_pins, placer_opts.place_algorithm,
placer_opts.timing_tradeoff, inverse_prev_bb_cost,
inverse_prev_timing_cost, &delay_cost) == 1) {
success_sum++;
av_cost += cost;
av_bb_cost += bb_cost;
av_timing_cost += timing_cost;
av_delay_cost += delay_cost;
sum_of_squares += cost * cost;
}
#ifdef VERBOSE
printf("t = %g cost = %g bb_cost = %g timing_cost = %g move = %d dmax = %g\n",
t, cost, bb_cost, timing_cost, inner_iter, d_max);
if (fabs(bb_cost - comp_bb_cost(CHECK, placer_opts.place_cost_type,
placer_opts.num_regions)) > bb_cost * ERROR_TOL)
exit(1);
#endif
}
moves_since_cost_recompute += move_lim;
if (moves_since_cost_recompute > MAX_MOVES_BEFORE_RECOMPUTE) {
new_bb_cost = recompute_bb_cost (placer_opts.place_cost_type,
placer_opts.num_regions);
if (fabs(new_bb_cost - bb_cost) > bb_cost * ERROR_TOL) {
printf("Error in try_place: new_bb_cost = %g, old bb_cost = %g.\n",
new_bb_cost, bb_cost);
exit (1);
}
bb_cost = new_bb_cost;
if (placer_opts.place_algorithm ==BOUNDING_BOX_PLACE) {
cost = new_bb_cost;
}
moves_since_cost_recompute = 0;
}
tot_iter += move_lim;
success_rat = ((float) success_sum)/ move_lim;
if (success_sum == 0) {
av_cost = cost;
av_bb_cost = bb_cost;
av_timing_cost = timing_cost;
av_delay_cost = delay_cost;
}
else {
av_cost /= success_sum;
av_bb_cost /= success_sum;
av_timing_cost /= success_sum;
av_delay_cost /= success_sum;
}
std_dev = get_std_dev (success_sum, sum_of_squares, av_cost);
#ifndef SPEC
printf("%11.5g %10.6g %11.6g %11.6g %11.6g %11.6g %11.4g %9.4g %8.3g %7.4g %7.4g %10d ",t, av_cost, av_bb_cost, av_timing_cost, av_delay_cost, place_delay_value, d_max, success_rat, std_dev, rlim, crit_exponent,tot_iter);
#endif
//the while loop continues, but till here is what needs to run on different cores.
总而言之,这里给出的代码,在函数调用的情况下,必须同时在多个内核上运行,即多次运行代码,每个代码都在一个单独的内核上。
答案 0 :(得分:3)
如果您不想逐行更改代码,可以尝试使用OpenACC。
OpenACC可以通过编译器指令轻松地并行化传统的科学和技术Fortran和C代码,而无需修改或调整底层代码本身。您只需要确定要加速的代码区域,插入编译器指令,然后编译器就可以将原始顺序计算映射到并行加速器。
我没有个人经验,但是,从我参加的一些会议演示中,并行化的简易性与性能有关。
答案 1 :(得分:0)
每个核心都需要处理单独的数据副本。所以我想我需要将数据从主机复制到设备内存。
是的,你会的。如果它是&#34;小&#34;矩阵,它可能适合您的目标CUDA(或OpenCL)设备的只读部分。这可能会产生显着的性能优势。如果没有,您的目标CUDA设备可能仍然比现有目标具有更快的内存访问速度。
是否可以在不逐行修改代码的情况下完成整个过程?
在大多数情况下,是的。如果你采用迭代方法的主轴或轴,而是使单个循环的主体使用一些聪明的索引来加载输入和/或存储结果,这就是端口的大多数挑战所在所在。它可能取决于所移植代码的复杂程度,但如果它是一个足够简单的算法,它不应该是一个很大的挑战。