Question

我正在为他们的并行编程课程进行Udacity测验。我非常坚持我应该如何开始作业，因为我不确定我是否正确理解它。

对于赋值（在代码中），我们在值和位置数组上给出了两个数组和数组。我们应该使用并行基数排序对值数组进行排序，同时也正确设置位置。

我完全理解基数排序及其工作原理。我不明白的是他们希望我们如何实施它。这是开始分配的模板

//Udacity HW 4
//Radix Sorting

#include "reference_calc.cpp"
#include "utils.h"

/* Red Eye Removal
   ===============

   For this assignment we are implementing red eye removal.  This is
   accomplished by first creating a score for every pixel that tells us how
   likely it is to be a red eye pixel.  We have already done this for you - you
   are receiving the scores and need to sort them in ascending order so that we
   know which pixels to alter to remove the red eye.

   Note: ascending order == smallest to largest

   Each score is associated with a position, when you sort the scores, you must
   also move the positions accordingly.

   Implementing Parallel Radix Sort with CUDA
   ==========================================

   The basic idea is to construct a histogram on each pass of how many of each
   "digit" there are.   Then we scan this histogram so that we know where to put
   the output of each digit.  For example, the first 1 must come after all the
   0s so we have to know how many 0s there are to be able to start moving 1s
   into the correct position.

   1) Histogram of the number of occurrences of each digit
   2) Exclusive Prefix Sum of Histogram
   3) Determine relative offset of each digit
        For example [0 0 1 1 0 0 1]
                ->  [0 1 0 1 2 3 2]
   4) Combine the results of steps 2 & 3 to determine the final
      output location for each element and move it there

   LSB Radix sort is an out-of-place sort and you will need to ping-pong values
   between the input and output buffers we have provided.  Make sure the final
   sorted results end up in the output buffer!  Hint: You may need to do a copy
   at the end.

 */


void your_sort(unsigned int* const d_inputVals,
               unsigned int* const d_inputPos,
               unsigned int* const d_outputVals,
               unsigned int* const d_outputPos,
               const size_t numElems)
{

}

我特别不明白这4个步骤如何最终对数组进行排序。

所以对于第一步，我应该创建一个“数字”的直方图（为什么在引号中？？）。因此，给定输入值n，我需要将0和1的计数放入直方图中。那么，第1步应该创建一个直方图数组，每个输入值一个吗？

而且，对于其余的步骤，它会很快崩溃。有人可以告诉我这些步骤应该如何实现基数排序？

Answer 1

基数排序背后的基本思想是我们将考虑逐个数字地对每个元素进行排序，从最不重要到最重要。对于每个数字，我们将移动元素，以便那些数字按递增顺序排列。

让我们举一个非常简单的例子。让我们对四个数量进行排序，每个数量都有4个二进制数字。让我们选择1,4,7和14.我们将它们混合起来并可视化二进制表示：

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001

首先我们将考虑第0位：

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001
bit 0:       1       0       0       1

现在基数排序算法说我们必须以这样一种方式移动元素:(仅考虑第0位）所有的零都在左边，而所有的都在右边。让保持元素的顺序为零位并且保持元素的顺序为一位时，我们这样做。我们可以这样做：

Element #    2       3       1       4
Value:       14      4       7       1
Binary:      1110    0100    0111    0001
bit 0:       0       0       1       1

我们的基数排序的第一步已经完成。下一步是考虑下一个（二进制）数字：

Element #    3       2       1       4
Value:       4       14      7       1
Binary:      0100    1110    0111    0001
bit 1:       0       1       1       0

我们必须再次移动元素，以便有问题的数字（第1位）按升序排列：

Element #    3       4       2       1
Value:       4       1       14      7
Binary:      0100    0001    1110    0111
bit 1:       0       0       1       1

现在我们必须转到下一个更高的数字：

Element #    3       4       2       1
Value:       4       1       14      7
Binary:      0100    0001    1110    0111
bit 2:       1       0       1       1

再次移动它们：

Element #    4       3       2       1
Value:       1       4       14      7
Binary:      0001    0100    1110    0111
bit 2:       0       1       1       1

现在我们转到最后一个（最高位）数字：

Element #    4       3       2       1
Value:       1       4       14      7
Binary:      0001    0100    1110    0111
bit 3:       0       0       1       0

并做出最后的决定：

Element #    4       3       1       2
Value:       1       4       7       14
Binary:      0001    0100    0111    1110
bit 3:       0       0       0       1

现在对值进行排序。这似乎很清楚，但到目前为止的描述中，我们已经掩盖了诸如＆＃34之类的细节;我们如何知道要移动哪些元素？＆＃34;和＆＃34;我们怎么知道放在哪里？＆＃34;因此，让我们重复我们的示例，但我们将使用提示中建议的特定方法和顺序，以回答这些问题。从第0位开始：

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001
bit 0:       1       0       0       1

首先让我们建立一个位0位零位数的直方图，以及位0位的1位数：

bit 0:       1       0       0       1

              zero bits       one bits
              ---------       --------
1)histogram:         2              2

现在让我们对这些直方图值进行独占前缀和：

              zero bits       one bits
              ---------       --------
1)histogram:         2              2
2)prefix sum:        0              2

独占前缀和只是所有先前值的总和。在第一个位置没有先前的值，在第二个位置，前一个值是2（位0位置的0位元素的数量）。现在，作为一个独立的操作，让我们确定所有零位中每个0位的相对偏移量，以及所有这一位中的每一位：

bit 0:       1       0       0       1
3)offset:    0       0       1       1

这实际上可以使用独占前缀和以编程方式完成，分别考虑0组和1组，并将每个位置视为具有值1：

0 bit 0:             1       1       
3)ex. psum:          0       1    

1 bit 0:     1                        1      
3)ex. psum:  0                        1

现在，给定算法的第4步说：

4）结合步骤2和步骤2的结果。 3确定每个元素的最终输出位置并将其移动到那里

这意味着，对于每个元素，我们将选择与其位值（0或1）对应的直方图 - bin前缀和值，并添加与其位置相关联的偏移量，以确定要移动的位置该元素：

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001
bit 0:       1       0       0       1
hist psum:   2       0       0       2
offset:      0       0       1       1
new index:   2       0       1       3

将每个元素移动到＆＃34;新索引＆＃34;职位，我们有：

Element #    2       3       1       4
Value:       14      4       7       1
Binary:      0111    1110    0111    0001

这正是我们根据之前的演练完成第一次数字移动所期望的结果。这已经完成了步骤1，即第一个（最不重要的）数字;我们仍然需要处理剩余的数字，在每一步创建一个新的直方图和新的前缀总和。

注意：

即使在计算机中，Radix-sort也不必严格基于二进制数字来完成。可以构造具有不同大小的数字的类似算法，可能由2,3或4位组成。
我们可以对基数排序执行的优化之一是仅根据实际有意义的位数进行排序。例如，如果我们以32位值存储数量，但我们知道存在的最大数量是1023（2 ^ 10-1），我们不需要对所有32位进行排序。在继续前10位后，我们可以停下来，期待一个正确的排序。
这与GPU有什么关系？就以上描述而言，并不多。实际应用是考虑使用并行算法来实现直方图，前缀和和数据移动。 radix-sort的这种分解允许人们定位和使用已经为这些更基本的操作开发的并行算法，以便构建快速并行排序。

以下是一个有效的例子。这可能有助于您理解基数排序。我认为它不会对你的任务有所帮助，因为这个例子在warp级别执行32位基数排序，对于单个warp，即。 32个数量。但是从理解的角度来看，一个可能的优势是，直方图和前缀和之类的东西可以在几个指令中在warp级别完成，利用各种CUDA内在函数。对于您的任务，您无法使用这些技术，并且您需要提供可以在任意数据集大小上运行的全功能并行前缀和，直方图等。

#include <stdio.h>
#include <stdlib.h>
#define WSIZE 32
#define LOOPS 100000
#define UPPER_BIT 31
#define LOWER_BIT 0

__device__ unsigned int ddata[WSIZE];

// naive warp-level bitwise radix sort

__global__ void mykernel(){
  __shared__ volatile unsigned int sdata[WSIZE*2];
  // load from global into shared variable
  sdata[threadIdx.x] = ddata[threadIdx.x];
  unsigned int bitmask = 1<<LOWER_BIT;
  unsigned int offset  = 0;
  unsigned int thrmask = 0xFFFFFFFFU << threadIdx.x;
  unsigned int mypos;
  //  for each LSB to MSB
  for (int i = LOWER_BIT; i <= UPPER_BIT; i++){
    unsigned int mydata = sdata[((WSIZE-1)-threadIdx.x)+offset];
    unsigned int mybit  = mydata&bitmask;
    // get population of ones and zeroes (cc 2.0 ballot)
    unsigned int ones = __ballot(mybit); // cc 2.0
    unsigned int zeroes = ~ones;
    offset ^= WSIZE; // switch ping-pong buffers
    // do zeroes, then ones
    if (!mybit) // threads with a zero bit
      // get my position in ping-pong buffer
      mypos = __popc(zeroes&thrmask);
    else        // threads with a one bit
      // get my position in ping-pong buffer
      mypos = __popc(zeroes)+__popc(ones&thrmask);
    // move to buffer  (or use shfl for cc 3.0)
    sdata[mypos-1+offset] = mydata;
    // repeat for next bit
    bitmask <<= 1;
    }
  // save results to global
  ddata[threadIdx.x] = sdata[threadIdx.x+offset];
  }

int main(){

  unsigned int hdata[WSIZE];
  for (int lcount = 0; lcount < LOOPS; lcount++){
    unsigned int range = 1U<<UPPER_BIT;
    for (int i = 0; i < WSIZE; i++) hdata[i] = rand()%range;
    cudaMemcpyToSymbol(ddata, hdata, WSIZE*sizeof(unsigned int));
    mykernel<<<1, WSIZE>>>();
    cudaMemcpyFromSymbol(hdata, ddata, WSIZE*sizeof(unsigned int));
    for (int i = 0; i < WSIZE-1; i++) if (hdata[i] > hdata[i+1]) {printf("sort error at loop %d, hdata[%d] = %d, hdata[%d] = %d\n", lcount,i, hdata[i],i+1, hdata[i+1]); return 1;}
    // printf("sorted data:\n");
    //for (int i = 0; i < WSIZE; i++) printf("%u\n", hdata[i]);
    }
  printf("Success!\n");
  return 0;
}

Answer 2

@Robert Crovella提供的方法绝对正确且非常有用。它与他们在Udacity视频中解释的过程略有不同。我会在这个答案中记录他们方法的一次迭代，watchable here，从Robert Crovella的例子中跳出来：

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001
LSB:         1       0       0       1

Predicate:   0     __1__   __1__     0
Pred. Scan:  0     __0__   __1__     2

Number of ones in predicate: 2

!Predicate:__1__     0       0     __1__
!Pred. Scan: 0       1       1       1

Offset for !Pred. Scan = Number of ones in predicate = 2

!Pred. Scan + Offset:
           __2__     3       3     __3__

Final indexes to move values after 1 iteration (on LSB):
             2       0       1       3

Values after 1 iteration (on LSB):
             14      4       7       1

我将重点（__ __）放在指示或包含将值移动到的索引的值上。

条款（来自Udacity视频）：

LSB =最低有效位
谓词（对于LSB）:( x＆amp; 1）== 0
- 表示下一个有效位：（x＆amp; 2）== 0
- 之后的那个：（x＆amp; 4）== 0
- 等等，左移更多（＆lt;＆lt;＆lt;＆lt;＆lt;＆lt;＆lt;＆gt;）
泼尼松。 Scan =谓词扫描=谓词独占前缀和
！泼尼松。 =谓词翻转的位（0-> 1和1-> 0）
谓词中的数量
- 请注意，这不一定是扫描中的最后一个条目，您可以将此值（谓词的总和/减少）作为Blelloch扫描的中间值

以上摘要是：

获取列表的谓词（共同点，从LSB开始）
扫描谓词，并记录流程中谓词的总和
- Blelloch Scan
- 请注意，您的谓词将具有任意大小，因此请阅读Blelloch Scan中有关任意而不是2 ^ n大小的数组的部分
翻转谓词的位，然后扫描
使用以下规则移动数组中的值：
- 对于数组中的第i个元素：
- 如果第i个谓词为TRUE，则将第i个值移动到谓词扫描的第i个元素中的索引
- 否则，将第i个值移动到！谓词扫描的第i个元素中的索引加上Predicate的总和
移至下一个有效位（NSB）

作为参考，您可以在CUDA中咨询my solution for this HW assignment。

并行基数排序，这个实现如何实际工作？有一些启发式吗？

2 个答案: