Question

CUDA C编程指南的

Section 5.4.2指出分支差异由“分支指令”处理，或者在某些条件下由“预测指令”处理。我不明白两者之间的区别，以及为什么一个导致比另一个更好的性能。

This comment表明分支指令会导致更多的执行指令，由于“分支地址解析和获取”而导致停顿，以及由于“分支本身”和“保留分歧”导致的开销，而谓词指令仅产生“执行条件测试并设置谓词的指令执行延迟”。为什么呢？

Answer 1

指令预测意味着线程根据谓词有条件地执行指令。谓词为true的线程执行指令，其余的则不执行任何操作。

例如：

var = 0;

// Not taken by all threads
if (condition) {
    var = 1;
} else {
    var = 2;
}

output = var;

会导致（不是实际的编译器输出）：

       mov.s32 var, 0;       // Executed by all threads.
       setp pred, condition; // Executed by all threads, sets predicate.

@pred  mov.s32 var, 1;       // Executed only by threads where pred is true.
@!pred mov.s32 var, 2;       // Executed only by threads where pred is false.
       mov.s32 output, var;  // Executed by all threads.

总而言之，这是if的3条指令，没有分支。很有效率。

带分支的等效代码如下所示：

       mov.s32 var, 0;       // Executed by all threads.
       setp pred, condition; // Executed by all threads, sets predicate.

@!pred bra IF_FALSE;         // Conditional branches are predicated instructions.
IF_TRUE:                    // Label for clarity, not actually used.
       mov.s32 var, 1;
       bra IF_END;
IF_FALSE:
       mov.s32 var, 2;
IF_END:
       mov.s32 output, var;

注意它是多长时间（if的5条指令）。条件分支需要禁用部分warp，执行第一个路径，然后回滚到warp发散的点并执行第二个路径直到两个收敛。它需要更长的时间，需要额外的簿记，更多的代码加载（特别是在有许多指令要执行的情况下），因此需要更多的内存请求。所有这些都使得分支比简单的预测慢。

实际上，在这个非常简单的条件赋值的情况下，编译器可以做得更好，只有if的2条指令：

mov.s32 var, 0;       // Executed by all threads.
setp pred, condition; // Executed by all threads, sets predicate.
selp var, 1, 2, pred; // Sets var depending on predicate (true: 1, false: 2).

分支和谓词指令

1 个答案: