Question

我正在进行图像压缩。

图像 I 被分解为K个代码块{Bi}。

每个块具有固定大小的MxN像素。

每个块都是独立压缩的。

具有压缩大小{Pi}的所有压缩块{Ci}存储在大小为K * M的线性缓冲器 B 中，其中M是大于所有大小Pi的固定大小。

现在，我想将缓冲区 B 打包到缓冲区 C 中，并删除每个压缩代码块Ci末尾的空白区域。

所以，我需要一个内核：

对于每个块Ci，找到所有Pk的总和，k <1。我，（称之为offset_i）
将每个Ci的数据从 B 复制到 C ，在offset_i，大小为Pi

任何关于如何做到这一点的想法都将非常感谢!!

Answer 1

这是代码片段，（我猜）它会进行流压缩。它包含大量算术，但可以并行化为所需的度量。

#include <time.h>
#include <stdio.h>
#include <stdlib.h>

typedef struct Block {
    int size;
    int buf[8];
} Block;

typedef struct BlockPos {
    int t_size; //Temporary size for compaction
    int f_size; //Actual size
    int pos;    //Position
} BlockPos;

int main()
{
    const int num_blocks = 16;
    Block blocks[num_blocks];
    BlockPos pos[num_blocks];

    srand(time(NULL));
    for (int i = 0; i < num_blocks; i++) {
        //Every block has non-zero length, that's easier
        blocks[i].size = rand() % 7 + 1;

        printf("Block %d len %d:\t", i, blocks[i].size);
        for(int j=0; j<blocks[i].size; j++){
            //Just to make print easier
            blocks[i].buf[j] = rand() % 33;
            printf("%d, ", blocks[i].buf[j]);
        }
        printf("\n");
    }

    for(int i=0; i<num_blocks; i++){
        pos[i].f_size = blocks[i].size;
        pos[i].t_size = pos[i].f_size;
        pos[i].pos = 0;
    }

    int step = 2;
    /* At every step we reduce number of blocks, being processed, two times.
     * This loop can't be done in parallel. */
    for (int count = 1; count < num_blocks; count *= 2) {

        /* All odd-numbered blocks are compacting to nearest left-side neighbour.
         * This loop can be done in parallel. */
        for (int i = count; i < num_blocks; i += step) {
            int dif = pos[i].pos;
            pos[i].pos = pos[i - count].pos + pos[i - count].t_size;
            pos[i - count].t_size += pos[i].t_size;
            dif -= pos[i].pos;

            // "Replace" previously compacted blocks
            for (int j = i+1; count > 1 && j < i+count; j++) {
                pos[j].pos = pos[j-1].pos + pos[j-1].f_size;
            }
        }
        step *= 2;
    }

    printf("\nPos,\tLen:\n");
    for(int i=0; i<num_blocks; i++){
        printf("%d,\t%d\n", pos[i].pos, pos[i].f_size);
    }

    printf("\n");
    return 0;
}

内部循环（第54行）可以实现为OpenCL内核，直到已处理元素的数量足够大。在此之后，您将拥有一系列结构，每个元素将显示放置压缩块的位置。它可以然后并行完成。

Answer 2

我理解你的问题如下：您有一组压缩缓冲区，每个缓冲区具有不同的长度。

最后你想要一个没有空格的简单megabuffer。为什么不像这样简单地将所有缓冲区存储在一个块中 - 首先将缓冲区的数量N写为长值 - 第二个存储长度为N的长值数组，其长度为每个缓冲区的大小 - 最后写下你的N个缓冲区

我不明白你为什么需要这个

的内核

Answer 3

您需要有权访问Pis的大小。我会使用一个临时缓冲区，其长度是块的总数。压缩块时，将压缩块的长度存储到此临时缓冲区中。然后，您的最新内核可以使用此临时缓冲区来计算它必须写入最终缓冲区的地址。出于性能原因，您可以将此临时缓冲区复制到本地内存中（在最后一个内核中）。

Answer 4

因此，事实证明我需要编写一个流压缩算法。

这将需要两个内核：

内核1 ：全前缀和算法（也称为扫描）来计算缓冲区偏移量（http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html）

此库https://github.com/boxerab/clpp具有用OpenCL编写的扫描算法，这是我的目标GPGPU语言。

内核2 ：每个工作组都使用内核1中计算的偏移量从输入缓冲区合并读取并写入输出缓冲区。

用于压缩缓冲区的快速算法

4 个答案: