我在使用SIMD和pthreads做什么让我的程序变慢了?

时间:2016-05-31 09:43:56

标签: c multithreading pthreads simd avx

!!!家庭作业 - 分配!!!

请不要发布代码,因为我想完成自己,但是如果可能的话,请指出正确的方向,包括一般信息或指出思想中的错误或其他可能有用和相关的资源。

我有一种方法可以创建npages * npages for use in my pagerank algorithm的方形double矩形帽。

我用pthreads,SIMD以及pthreads和SIMD制作了它。我使用了xcode仪器时间分析器,发现只有pthreads版本是最快的,接下来是仅SIMD版本,而最慢版本是SIMD和pthreads版本。

因为它是家庭作业,它可以在多个不同的机器上运行,但是我们给了标题#include,所以我们假设我们至少可以使用AVX。我们给出了程序将用作程序参数的线程数,并将其存储在全局变量g_nthreads中。

在我的测试中,我一直在我的机器上测试它,它是一个带有4个硬件核心和8个逻辑核心的IvyBridge,我用4个线程作为参数测试它,并以8个线程作为参数。< / p>

运行时间:

仅限SIMD:

* 331ms - 对于consturct_matrix_hat函数*

PTHREADS ONLY(8个主题):

70ms - 每个线程同时

SIMD&amp; PTHREADS(8个主题):

110ms - 每个线程同时

当我使用两种形式的优化时,我在做什么会减慢它的速度?

我将发布每个实现:

所有版本都共享这些宏:

#define BIG_CHUNK    (g_n2/g_nthreads)
#define SMALL_CHUNK  (g_npages/g_nthreads)
#define MOD          BIG_CHUNK - (BIG_CHUNK % 4)
#define IDX(a, b)    ((a * g_npages) + b)

P线程:

// struct used for passing arguments
typedef struct {
    double* restrict m;
    double* restrict m_hat;
    int t_id;
    char padding[44];
} t_arg_matrix_hat;

// Construct matrix_hat with pthreads
static void* pthread_construct_matrix_hat(void* arg) {
    t_arg_matrix_hat* t_arg = (t_arg_matrix_hat*) arg;
    // set coordinate limits thread is able to act upon
    size_t start = t_arg->t_id * BIG_CHUNK;
    size_t end = t_arg->t_id + 1 != g_nthreads ? (t_arg->t_id + 1) * BIG_CHUNK : g_n2;

    // Initialise coordinates with given uniform value
    for (size_t i = start; i < end; i++) {
        t_arg->m_hat[i] = ((g_dampener * t_arg->m[i]) + HAT);
    }

    return NULL;
}

// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
    double* matrix_hat = malloc(sizeof(double) * g_n2);

    // create structs to send and retrieve matrix and value from threads
    t_arg_matrix_hat t_args[g_nthreads];
    for (size_t i = 0; i < g_nthreads; i++) {
        t_args[i] = (t_arg_matrix_hat) {
            .m = matrix,
            .m_hat = matrix_hat,
            .t_id = i
        };
    }
    // create threads and send structs with matrix and value to divide the matrix and
    // initialise the coordinates with the given value
    pthread_t threads[g_nthreads];
    for (size_t i = 0; i < g_nthreads; i++) {
        pthread_create(threads + i, NULL, pthread_construct_matrix_hat, t_args + i);
    }
    // join threads after all coordinates have been intialised
    for (size_t i = 0; i < g_nthreads; i++) {
        pthread_join(threads[i], NULL);
    }

    return matrix_hat;
}

SIMD:

// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
    double* matrix_hat = malloc(sizeof(double) * g_n2);

    double dampeners[4] = {g_dampener, g_dampener, g_dampener, g_dampener};
    __m256d b = _mm256_loadu_pd(dampeners);
    // Use simd to subtract values from each other
    for (size_t i = 0; i < g_mod; i += 4) {
        __m256d a = _mm256_loadu_pd(matrix + i);
        __m256d res = _mm256_mul_pd(a, b);
        _mm256_storeu_pd(&matrix_hat[i], res);
    }
    // Subtract values from each other that weren't included in simd
    for (size_t i = g_mod; i < g_n2; i++) {
        matrix_hat[i] = g_dampener * matrix[i];
    }
    double hats[4] = {HAT, HAT, HAT, HAT};
    b = _mm256_loadu_pd(hats);
    // Use simd to raise each value to the power 2
    for (size_t i = 0; i < g_mod; i += 4) {
        __m256d a = _mm256_loadu_pd(matrix_hat + i);
        __m256d res = _mm256_add_pd(a, b);
        _mm256_storeu_pd(&matrix_hat[i], res);
    }
    // Raise each value to the power 2 that wasn't included in simd
    for (size_t i = g_mod; i < g_n2; i++) {
        matrix_hat[i] += HAT;
    }

    return matrix_hat;
}

Pthreads&amp; SIMD:

// struct used for passing arguments
typedef struct {
    double* restrict m;
    double* restrict m_hat;
    int t_id;
    char padding[44];
} t_arg_matrix_hat;

// Construct matrix_hat with pthreads
static void* pthread_construct_matrix_hat(void* arg) {
    t_arg_matrix_hat* t_arg = (t_arg_matrix_hat*) arg;
    // set coordinate limits thread is able to act upon
    size_t start = t_arg->t_id * BIG_CHUNK;
    size_t end = t_arg->t_id + 1 != g_nthreads ? (t_arg->t_id + 1) * BIG_CHUNK : g_n2;
    size_t leftovers = start + MOD;

    __m256d b1 = _mm256_loadu_pd(dampeners);
    //
    for (size_t i = start; i < leftovers; i += 4) {
        __m256d a1 = _mm256_loadu_pd(t_arg->m + i);
        __m256d r1 = _mm256_mul_pd(a1, b1);
        _mm256_storeu_pd(&t_arg->m_hat[i], r1);
    }
    //
    for (size_t i = leftovers; i < end; i++) {
        t_arg->m_hat[i] = dampeners[0] * t_arg->m[i];
    }

    __m256d b2 = _mm256_loadu_pd(hats);
    //
    for (size_t i = start; i < leftovers; i += 4) {
        __m256d a2 = _mm256_loadu_pd(t_arg->m_hat + i);
        __m256d r2 = _mm256_add_pd(a2, b2);
        _mm256_storeu_pd(&t_arg->m_hat[i], r2);
    }
    //
    for (size_t i = leftovers; i < end; i++) {
        t_arg->m_hat[i] += hats[0];
    }

    return NULL;
}

// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
    double* matrix_hat = malloc(sizeof(double) * g_n2);

    // create structs to send and retrieve matrix and value from threads
    t_arg_matrix_hat t_args[g_nthreads];
    for (size_t i = 0; i < g_nthreads; i++) {
        t_args[i] = (t_arg_matrix_hat) {
            .m = matrix,
            .m_hat = matrix_hat,
            .t_id = i
        };
    }
    // create threads and send structs with matrix and value to divide the matrix and
    // initialise the coordinates with the given value
    pthread_t threads[g_nthreads];
    for (size_t i = 0; i < g_nthreads; i++) {
        pthread_create(threads + i, NULL, pthread_construct_matrix_hat, t_args + i);
    }
    // join threads after all coordinates have been intialised
    for (size_t i = 0; i < g_nthreads; i++) {
        pthread_join(threads[i], NULL);
    }

    return matrix_hat;
}

1 个答案:

答案 0 :(得分:2)

我认为这是因为你的SIMD代码非常低效:它会在内存中循环两次,而不是在存储之前使用乘法进行添加。您没有测试SIMD与标量基线,但如果您有可能发现您的SIMD代码不是单线程的加速。

如果您想自己解决剩余的作业,请在此处停止阅读。

如果使用phoneUS,pthread版本中的简单标量循环可能会自动向量化为类似于内在函数的内容。您甚至使用了gcc -O3 -march=ivybridge,因此可能会发现指针不能相互重叠,或者与restrict重叠。

g_dampener

这可能不是FP循环的问题,因为// this probably autovectorizes well. // Initialise coordinates with given uniform value for (size_t i = start; i < end; i++) { t_arg->m_hat[i] = ((g_dampener * t_arg->m[i]) + HAT); } // but this would be even safer to help the compiler's aliasing analysis: double dampener = g_dampener; // in case the compiler things one of the pointers might point at the global double *restrict hat = t_arg->hat; const double *restrict mat = t_arg->m; ... same loop but using these locals instead of 肯定不能与double别名。

编码风格也很糟糕。您应尽可能为double *变量指定有意义的名称。

此外,您使用__m256d,但不保证malloc将与32B边界对齐。 C11's aligned_alloc is probably the nicest way,与matrix_hat(笨重的界面),posix_memalign(必须使用_mm_malloc免费,而不是_mm_free)或其他选项。

free(3)

此版本compiles (after defining a couple globals) to the asm we expect。顺便说一下,普通人将大小作为函数参数传递。这是避免由于C别名规则导致优化失败的另一种方法。

无论如何,最好的办法是让OpenMP自动对其进行矢量化,因为这样你就不必自己编写一个清理循环了。数据组织并没有什么棘手的问题,因此它可以简单地进行矢量化。 (并且它不是减少,就像在你的其他问题中一样,所以没有循环依赖或操作顺序问题。)