长话短说,我遇到了Monty Hall problem,并且有兴趣将某些东西放在一起,以便可以进行计算测试。效果很好,但是在此过程中,我对C ++中的多线程应用程序感到好奇。我是CS学生,但是我只是用另一种语言简要介绍了该主题。我想看看是否可以利用一些额外的CPU内核来使Monte Hall仿真运行得更快一些。
似乎我可以使用它了,但是可惜它实际上并没有任何性能上的提高。该程序在一个简单的函数上执行了大量的迭代,该函数实质上可以归结为几个rand_r()调用和几个比较。我希望它是一个可以在线程之间拆分的琐碎示例,基本上就是让每个线程处理全部迭代的相同部分。
我只是想了解这一点,我想知道我是在犯错还是在后台执行多线程执行,即使我仅在代码中指定了1个线程。
无论如何,看看并分享您的想法。还请记住,我只是作为学习经验来做这件事,最初并不打算让其他人阅读它:D
#include <cstdlib>
#include <climits>
#include <ctime>
#include <iostream>
#include <thread>
#include <chrono>
enum strategy {STAY = 0, SWITCH = 1};
unsigned ITERATIONS = 1;
unsigned THREADS = 5;
struct counts
{
unsigned stay_correct_c;
unsigned switch_correct_c;
};
void simulate (struct counts&, unsigned&);
bool game (enum strategy, unsigned&);
int main (int argc, char **argv)
{
if (argc < 2)
std::cout << "Usage: " << argv[0] << " -i [t|s|m|l|x] -t [1|2|4|5|10]\n", exit(1);
if (argv[1][1] == 'i') {
switch (argv[2][0]) {
case 's':
ITERATIONS = 1000;
break;
case 'm':
ITERATIONS = 100000;
break;
case 'l':
ITERATIONS = 10000000;
break;
case 'x':
ITERATIONS = 1000000000;
break;
default:
std::cerr << "Invalid argument.\n", exit(1);
}
}
if (argv[3][1] == 't') {
switch (argv[4][0])
{
case '1':
if (argv[4][1] != '0')
THREADS = 1;
else if (argv[4][1] == '0')
THREADS = 10;
break;
case '2':
THREADS = 2;
break;
case '4':
THREADS = 4;
break;
case '5':
THREADS = 5;
break;
}
}
srand(time(NULL));
auto start = std::chrono::high_resolution_clock::now();
struct counts total_counts;
total_counts.stay_correct_c = 0;
total_counts.switch_correct_c = 0;
struct counts per_thread_count[THREADS];
std::thread* threads[THREADS];
unsigned seeds[THREADS];
for (unsigned i = 0; i < THREADS; ++i) {
seeds[i] = rand() % UINT_MAX;
threads[i] = new std::thread (simulate, std::ref(per_thread_count[i]), std::ref(seeds[i]));
}
for (unsigned i = 0; i < THREADS; ++i) {
std::cout << "Waiting for thread " << i << " to finish...\n";
threads[i]->join();
}
for (unsigned i = 0; i < THREADS; ++i) {
total_counts.stay_correct_c += per_thread_count[i].stay_correct_c;
total_counts.switch_correct_c += per_thread_count[i].switch_correct_c;
}
auto stop = std::chrono::high_resolution_clock::now();
std::cout <<
"The simulation performed " << ITERATIONS <<
" iterations on " << THREADS << " threads of both the stay and switch strategies " <<
"taking " << std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count() <<
" ms." << std::endl <<
"Score:" << std::endl <<
" Stay Strategy: " << total_counts.stay_correct_c << std::endl <<
" Switch Strategy: " << total_counts.switch_correct_c << std::endl << std::endl <<
"Ratios:" << std::endl <<
" Stay Strategy: " << (double)total_counts.stay_correct_c / (double)ITERATIONS << std::endl <<
" Switch Strategy: " << (double)total_counts.switch_correct_c / (double)ITERATIONS << std::endl << std::endl;
}
void simulate (struct counts& c, unsigned& seed)
{
c.stay_correct_c = 0;
c.switch_correct_c = 0;
for (unsigned i = 0; i < (ITERATIONS / THREADS); ++i) {
if (game (STAY, seed))
++c.stay_correct_c;
if (game (SWITCH, seed))
++c.switch_correct_c;
}
}
bool game (enum strategy player_strat, unsigned& seed)
{
unsigned correct_door = rand_r(&seed) % 3;
unsigned player_choice = rand_r(&seed) % 3;
unsigned elim_door;
do {
elim_door = rand_r(&seed) % 3;
}
while ((elim_door != correct_door) && (elim_door != player_choice));
seed = rand_r(&seed);
if (player_strat == SWITCH) {
do
player_choice = (player_choice + 1) % 3;
while (player_choice != elim_door);
return correct_door == player_choice;
}
else
return correct_door == player_choice;
}
编辑:将在下面的一些扎实评论的建议下添加一些补充信息。
我在6核/ 12线程AMD Ryzen r5 1600上运行。Htop显示了从命令行参数期望的高利用率逻辑核的数量。 PID的数量与指定的线程数加一相同,利用率为〜= 100%的逻辑核心数与每种情况下指定的线程数相同。
就数字而言,以下是我使用l标志进行了大量迭代收集的一些数据:
CORES AVG MIN MAX
1 102541 102503 102613
4 90183 86770 96248
10 72119 63581 91438
使用与该程序一样简单的划分方法,我希望随着添加线程的出现,总时间会线性减少,但是显然我缺少了一些东西。我的想法是,如果1个线程可以在y时间执行x次仿真,那么该线程应该能够在y / 4时间执行x / 4次仿真。我在这里误会什么?
编辑2:我应该补充一点,因为上面的代码存在,所以不同线程的时间差异不太明显,但是我做了一些小的优化,使增量变大了。
答案 0 :(得分:1)
感谢您发布代码;它无法在我的计算机上编译(Apple LLVM版本9.0.0(clang-900.0.39.2))。爱标准。
我将其黑客入侵为C版本,而您的问题似乎是虚假共享;也就是说,每个线程都经常访问其“种子”条目,但是由于内存缓存将相邻的位置聚合到“行”中,因此CPU占用了所有时间来回复制这些行。如果您将“种子”的定义更改为以下内容:
struct myseed {
unsigned seed;
unsigned dont_share_me[15];
};
您应该看到所需的可伸缩性。您可能希望对结构计数执行相同的操作。 通常,malloc会为您进行此调整,因此,如果将“每个线程”上下文标记到包中并进行malloc分配,它将返回正确的缓存对齐位置。