在C ++中学习多线程:即使看起来应该,添加线程也不能使执行速度更快

时间:2018-10-28 09:35:24

标签: c++ multithreading

长话短说,我遇到了Monty Hall problem,并且有兴趣将某些东西放在一起,以便可以进行计算测试。效果很好,但是在此过程中,我对C ++中的多线程应用程序感到好奇。我是CS学生,但是我只是用另一种语言简要介绍了该主题。我想看看是否可以利用一些额外的CPU内核来使Monte Hall仿真运行得更快一些。

似乎我可以使用它了,但是可惜它实际上并没有任何性能上的提高。该程序在一个简单的函数上执行了大量的迭代,该函数实质上可以归结为几个rand_r()调用和几个比较。我希望它是一个可以在线程之间拆分的琐碎示例,基本上就是让每个线程处理全部迭代的相同部分。

我只是想了解这一点,我想知道我是在犯错还是在后台执行多线程执行,即使我仅在代码中指定了1个线程。

无论如何,看看并分享您的想法。还请记住,我只是作为学习经验来做这件事,最初并不打算让其他人阅读它:D

#include <cstdlib>
#include <climits>
#include <ctime>
#include <iostream>
#include <thread>
#include <chrono>

enum strategy {STAY = 0, SWITCH = 1};
unsigned ITERATIONS = 1;
unsigned THREADS = 5;

struct counts
{
    unsigned stay_correct_c;
    unsigned switch_correct_c;
};

void simulate (struct counts&, unsigned&);
bool game (enum strategy, unsigned&);

int main (int argc, char **argv)
{
    if (argc < 2)
        std::cout << "Usage: " << argv[0] << " -i [t|s|m|l|x] -t [1|2|4|5|10]\n", exit(1);

    if (argv[1][1] == 'i') {
        switch (argv[2][0]) {
    case 's':
            ITERATIONS = 1000;
            break;
        case 'm':
            ITERATIONS = 100000;
            break;
        case 'l':
            ITERATIONS = 10000000;
            break;
        case 'x':
            ITERATIONS = 1000000000;
            break;
        default:
            std::cerr << "Invalid argument.\n", exit(1);
        }
    }

    if (argv[3][1] == 't') {
        switch (argv[4][0])
        {
        case '1':
            if (argv[4][1] != '0')
                THREADS = 1;
            else if (argv[4][1] == '0')
                THREADS = 10;
            break;
        case '2':
            THREADS = 2;
            break;
        case '4':
            THREADS = 4;
            break;
        case '5':
            THREADS = 5;
            break;
        }
    }

    srand(time(NULL));

    auto start = std::chrono::high_resolution_clock::now();
    struct counts total_counts;
    total_counts.stay_correct_c = 0;
    total_counts.switch_correct_c = 0;
    struct counts per_thread_count[THREADS];
    std::thread* threads[THREADS];
    unsigned seeds[THREADS];

    for (unsigned i = 0; i < THREADS; ++i) {
        seeds[i] = rand() % UINT_MAX;
        threads[i] = new std::thread (simulate, std::ref(per_thread_count[i]), std::ref(seeds[i]));
    }

    for (unsigned i = 0; i < THREADS; ++i) {
        std::cout << "Waiting for thread " << i << " to finish...\n";
        threads[i]->join();
    }

    for (unsigned i = 0; i < THREADS; ++i) {
        total_counts.stay_correct_c += per_thread_count[i].stay_correct_c;
        total_counts.switch_correct_c += per_thread_count[i].switch_correct_c;
    }

    auto stop = std::chrono::high_resolution_clock::now();
    std::cout <<
        "The simulation performed " << ITERATIONS <<
        " iterations on " << THREADS << " threads of both the stay and switch strategies " <<
        "taking " << std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count() <<
        " ms." << std::endl <<
        "Score:" << std::endl <<
        "  Stay Strategy: " << total_counts.stay_correct_c << std::endl <<
        "  Switch Strategy: " << total_counts.switch_correct_c << std::endl << std::endl <<
        "Ratios:" << std::endl <<
        "  Stay Strategy: " << (double)total_counts.stay_correct_c / (double)ITERATIONS << std::endl <<
        "  Switch Strategy: " << (double)total_counts.switch_correct_c / (double)ITERATIONS << std::endl << std::endl;
}

void simulate (struct counts& c, unsigned& seed)
{
    c.stay_correct_c = 0;
    c.switch_correct_c = 0;
    for (unsigned i = 0; i < (ITERATIONS / THREADS); ++i) {
        if (game (STAY, seed))
            ++c.stay_correct_c;
        if (game (SWITCH, seed))
            ++c.switch_correct_c;
    }
}

bool game (enum strategy player_strat, unsigned& seed)
{
    unsigned correct_door = rand_r(&seed) % 3;
    unsigned player_choice = rand_r(&seed) % 3;
    unsigned elim_door;
    do {
        elim_door = rand_r(&seed) % 3;
    }
    while ((elim_door != correct_door) && (elim_door != player_choice));
    seed = rand_r(&seed);
    if (player_strat == SWITCH) {
        do
            player_choice = (player_choice + 1) % 3;
        while (player_choice != elim_door);
        return correct_door == player_choice;
    }
    else
        return correct_door == player_choice;
}

编辑:将在下面的一些扎实评论的建议下添加一些补充信息。

我在6核/ 12线程AMD Ryzen r5 1600上运行。Htop显示了从命令行参数期望的高利用率逻辑核的数量。 PID的数量与指定的线程数加一相同,利用率为〜= 100%的逻辑核心数与每种情况下指定的线程数相同。

就数字而言,以下是我使用l标志进行了大量迭代收集的一些数据:

CORES    AVG      MIN      MAX
1     102541   102503   102613
4      90183    86770    96248
10     72119    63581    91438

使用与该程序一样简单的划分方法,我希望随着添加线程的出现,总时间会线性减少,但是显然我缺少了一些东西。我的想法是,如果1个线程可以在y时间执行x次仿真,那么该线程应该能够在y / 4时间执行x / 4次仿真。我在这里误会什么?

编辑2:我应该补充一点,因为上面的代码存在,所以不同线程的时间差异不太明显,但是我做了一些小的优化,使增量变大了。

1 个答案:

答案 0 :(得分:1)

感谢您发布代码;它无法在我的计算机上编译(Apple LLVM版本9.0.0(clang-900.0.39.2))。爱标准。

我将其黑客入侵为C版本,而您的问题似乎是虚假共享;也就是说,每个线程都经常访问其“种子”条目,但是由于内存缓存将相邻的位置聚合到“行”中,因此CPU占用了所有时间来回复制这些行。如果您将“种子”的定义更改为以下内容:

struct  myseed {
      unsigned seed;
      unsigned dont_share_me[15];
};

您应该看到所需的可伸缩性。您可能希望对结构计数执行相同的操作。 通常,malloc会为您进行此调整,因此,如果将“每个线程”上下文标记到包中并进行malloc分配,它将返回正确的缓存对齐位置。