Question

我编写了一个简单的基准来测试和测量处理器和OpenCL设备的单精度融合乘法附加性能。

我最近使用Pthread添加了SMP支持。 CPU侧很简单，它为输入生成几个随机矩阵，以确保编译器无法优化工作。

函数cpu_result_matrix（）创建线程，并阻塞，直到每个线程使用pthread_join（）返回。正是这个功能决定了设备的性能。

static float *cpu_result_matrix(struct bench_buf *in)
{
        const unsigned tc = nthreads();
        struct cpu_res_arg targ[tc];

        float *res = aligned_alloc(16, BUFFER_SIZE * sizeof(float));

        for (unsigned i = 0; i < tc; i++) {
                targ[i].tid = i;
                targ[i].tc = tc;
                targ[i].in = in;
                targ[i].ret = res;
        }

        pthread_t cpu_res_t[tc];

        for (unsigned i = 0; i < tc; i++)
                pthread_create(&cpu_res_t[i], NULL,
                               cpu_result_matrix_mt, (void *)&targ[i]);

        for (unsigned i = 0; i < tc; i++)
                pthread_join(cpu_res_t[i], NULL);

        return res;
}

实际内核位于cpu_result_matrix_mt（）：

static void *cpu_result_matrix_mt(void *v_arg)
{
        struct cpu_res_arg *arg = (struct cpu_res_arg *)v_arg;

        const unsigned buff_size = BUFFER_SIZE;
        const unsigned work_size = buff_size / arg->tc;
        const unsigned work_start = arg->tid * work_size;
        const unsigned work_end = work_start + work_size;

        const unsigned round_cnt = ROUNDS_PER_ITERATION;

        float lres;

        for (unsigned i = work_start; i < work_end; i++) {

                lres = 0;
                float a = arg->in->a[i], b = arg->in->b[i], c = arg->in->c[i];

                for (unsigned j = 0; j < round_cnt; j++) {
                        lres += a * ((b * c) + b);
                        lres += b * ((c * a) + c);
                        lres += c * ((a * b) + a);
                }

                arg->ret[i] = lres;
        }

        return NULL;
}

我注意到，无论我多少次展开内循环，内核所报告的时间大致相同。

为了研究，我通过手动展开内部循环使内核变得更大，直到我可以轻松地测量程序运行的挂起时间。

在这个过程中，我观察到（似乎）线程在内核完成它实际应该做的工作之前返回，这导致pthread_join（）停止阻塞主线程，并且执行时间看起来很多低于它真的。我不明白这是怎么可能的，或者程序如何在这些条件下继续运行并输出正确的结果。

Htop显示线程仍然非常活跃并且正常工作。我检查了pthread_join（）的返回值，并且在每次运行后都成功了。我很好奇，并在返回语句之前在内核的末尾添加了一个print语句，果然，每个线程都打印出它完成得比它应该的更早。

我在运行程序时看了ps，它显示了一个线程，然后是另外三个，另外五个，然后它下降到四个。

我感到困惑，我以前从没见过线程就像这样。

我修改过的测试分支的完整来源是：https://github.com/jakogut/clperf/tree/test

Answer 1

OpenMP似乎是一个更好的解决方案。对于可以利用数据并行性的问题，它需要更少的设置和复杂性。

static float *cpu_result_matrix(struct bench_buf *in)
{
        float *res = aligned_alloc(16, BUFFER_SIZE * sizeof(float));

        #pragma omp parallel for
        for (unsigned i = 0; i < BUFFER_SIZE; i++) {

                float a = in->a[i], b = in->b[i], c = in->c[i];

                for (unsigned j = 0; j < ROUNDS_PER_ITERATION; j++) {
                        res[i] += a * ((b * c) + b);
                        res[i] += b * ((c * a) + c);
                        res[i] += c * ((a * b) + a);
                }
        }

        return res;
}

然而，这并不能解释为什么pthreads表现得像问题一样。

在循环完成之前返回的Pthreads，工作似乎在后台继续

1 个答案: