Question

我有一个小型的C程序，使用monte-carlo模拟计算 pi - 基本上只测试一个随机点[x，y]，如果它在内部或外部一个圆圈。

要近似 pi ，我必须使用大量样本 n ，其具有 O（n）的直接比例复杂度。因此，尝试计算大量样本n，我实现了POSIX threads api来平衡计算能力。

我的代码如下所示：

pthread_t worker[nthreads]; /* creates workers for each thread */
struct param aparam[nthreads]; /* struct param{ long* hits; long rounds; }; */
long nrounds = nsamples / nthreads; /* divide samples to subsets of equal rounds per thread */

for (int i = 0; i < nthreads; ++i) { /* loop to create threads */
    aparam[i].hits = 0;
    aparam[i].rounds = nrounds;
    pthread_create(&worker[i], NULL, calc_pi, &aparam[i]); /* calls calc_pi(void* vparam){}  */ 
}

long nhits = 0;
for (int j = 0; j < nthreads; ++j) { /* collects results */
    pthread_join(worker[j], NULL);
    nhits += (long)aparam[j].hits; /* counts hits inside the cicrle */
}

这就是每个线程正在做的事情：

void* calc_pi(void* vparam)
{ /* counts hits inside a circle */
    struct param *iparam;
    iparam = (struct param *) vparam;
    long hits = 0;
    float x, y, z;
    for (long i = 0; i < iparam->rounds; ++i) {
        x = (float)rand()/RAND_MAX;
        y = (float)rand()/RAND_MAX;
        z = x * x + y * y;
        if (z <= 1.f) /* circle radius of 1 */
            ++hits;
    }
    iparam->hits = (long*)hits;
    return NULL;
}

现在我有一个奇怪的观察。使用相同的样本集 n 并且线程数增加 i ，此程序需要更多时间而不是更少。

以下是一些平均运行时间（可重复）：

-------------------------------------------------
| Threads[1] | Samples[1] | Rounds[1] | Time[s] |
-------------------------------------------------
|        32  |  268435456 |   8388608 |    118  |
|        16  |  268435456 |  16777216 |    106  |
|         8  |  268435456 |  33554432 |    125  |
|         4  |  268435456 |  67108864 |    152  |
|         2  |  268435456 | 134217728 |     36  |
|         1  |  268435456 | 268435456 |     15  |
-------------------------------------------------

为什么例如两个线程执行相同的工作所花费的时间比单个线程多两倍？我的假设是划分工作的两个线程应该将时间减少至少50％。

使用GCC 4.9.1和以下标志编译：

gcc -O2 -std=gnu11 -pthread pipa.c -lpthread -o pipa

我的硬件是双Intel Xeon E5520（2个处理器，每个4核）@ 2.26 GHz，禁用超线程，使用2.6.18内核运行科学linux。

有什么想法吗？

Answer 1

线程执行的最昂贵的操作是调用rand()。 rand()是一个天真的，简单的，通常是非MT可扩展的函数（因为它保证相同的种子产生相同的随机数字序列）。我认为rand()内的锁是序列化所有线程。（*）

确认是否存在问题的一个简单方法是在调试器下启动程序，然后多次：暂停它，捕获线程的堆栈跟踪，继续。无论堆栈中最常出现的是什么，很可能是瓶颈。

（*）使得它更慢的原因是锁争用会导致额外的性能损失。此外，许多线程增加了进程调度和上下文切换的额外开销。

将工作分成更多线程需要更多时间，为什么？

1 个答案: