我已经在C ++中实现了合并排序算法。
在算法内部,它检查数组的大小是否大于min_size_to_thread
,如果大于,则:用线程递归调用函数。
但是当我增加min_size_to_thread
时:这减少了正在使用的线程数,该函数变得更快。即使从1到2个线程。
我的假设是,随着线程数量增加,功能速度将增加,直到一个点,然后又开始下降。这对我来说毫无意义,所以我开始相信我的实现是某种错误。
template <typename T>
void merge_sort(T S[], int S_size, int min_size_to_thread)
{
if (S_size < 2) return;
// Left Sequence
int L_size = S_size / 2;
T* L = new T[L_size];
for (int i = 0; i < L_size; i++)
{
L[i] = S[i];
}
// Right Sequence
int R_size = (S_size + 1) / 2;
T* R = new T[R_size];
for (int i = 0; i < R_size; i++)
{
R[i] = S[i + L_size];
}
if (S_size > min_size_to_thread)
{
std::thread thread_left(merge_sort<T>, L, L_size, min_size_to_thread);
std::thread thread_right(merge_sort<T>, R, R_size, min_size_to_thread);
thread_right.join();
thread_left.join();
}
else
{
merge_sort<T>(L, L_size, min_size_to_thread);
merge_sort<T>(R, R_size, min_size_to_thread);
}
int S_iterator = 0;
int L_iterator = 0;
int R_iterator = 0;
while ((L_iterator < L_size) && (R_iterator < R_size))
{
if (L[L_iterator] < R[R_iterator])
{
S[S_iterator] = L[L_iterator];
++L_iterator;
}
else
{
S[S_iterator] = R[R_iterator];
++R_iterator;
}
++S_iterator;
}
while (L_iterator < L_size)
{
S[S_iterator] = L[L_iterator];
++L_iterator;
++S_iterator;
}
while (R_iterator < R_size)
{
S[S_iterator] = R[R_iterator];
++R_iterator;
++S_iterator;
}
delete[] L;
delete[] R;
}
int main()
{
const int S_size = 500000;
unsigned char S[S_size];
for (int i = 0; i < S_size; ++i)
{
S[i] = i % 255;
}
int min_size_to_thread;
min_size_to_thread = 250;
auto t1 = std::chrono::high_resolution_clock::now();
merge_sort(S, S_size, min_size_to_thread);
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << "size > " << min_size_to_thread << ": " << (t2 - t1) / std::chrono::milliseconds(1) << std::endl;
for (int i = 0; i < S_size; ++i)
{
S[i] = i % 255;
}
min_size_to_thread = 500;
t1 = std::chrono::high_resolution_clock::now();
merge_sort(S, S_size, min_size_to_thread);
t2 = std::chrono::high_resolution_clock::now();
std::cout << "size > " << min_size_to_thread << ": " << (t2 - t1) / std::chrono::milliseconds(1) << std::endl;
for (int i = 0; i < S_size; ++i)
{
S[i] = i % 255;
}
min_size_to_thread = 1000;
t1 = std::chrono::high_resolution_clock::now();
merge_sort(S, S_size, min_size_to_thread);
t2 = std::chrono::high_resolution_clock::now();
std::cout << "size > " << min_size_to_thread << ": " << (t2 - t1) / std::chrono::milliseconds(1) << std::endl;
for (int i = 0; i < S_size; ++i)
{
S[i] = i % 255;
}
min_size_to_thread = 10000;
t1 = std::chrono::high_resolution_clock::now();
merge_sort(S, S_size, min_size_to_thread);
t2 = std::chrono::high_resolution_clock::now();
std::cout << "size > " << min_size_to_thread << ": " << (t2 - t1) / std::chrono::milliseconds(1) << std::endl;
for (int i = 0; i < S_size; ++i)
{
S[i] = i % 255;
}
min_size_to_thread = 250000;
t1 = std::chrono::high_resolution_clock::now();
merge_sort(S, S_size, min_size_to_thread);
t2 = std::chrono::high_resolution_clock::now();
std::cout << "size > " << min_size_to_thread << ": " << (t2 - t1) / std::chrono::milliseconds(1) << std::endl;
for (int i = 0; i < S_size; ++i)
{
S[i] = i % 255;
}
min_size_to_thread = 500000;
t1 = std::chrono::high_resolution_clock::now();
merge_sort(S, S_size, min_size_to_thread);
t2 = std::chrono::high_resolution_clock::now();
std::cout << "size > " << min_size_to_thread << ": " << (t2 - t1) / std::chrono::milliseconds(1) << std::endl;
return 0;
}
答案 0 :(得分:3)
我已经编译并运行了您的确切程序,除了添加包含项外,没有任何修改,结果大致符合您的预期:
size > 250: 169
size > 500: 85
size > 1000: 50
size > 10000: 29
size > 250000: 42
size > 500000: 89
根据您的屏幕截图,我收集到您正在Visual Studio中运行代码。默认的运行按钮会将调试器附加到您的可执行文件,并降低运行时性能。而是,按Ctrl + F5可以在没有调试器的情况下运行,或者从菜单Debug-> Start Without Debugging运行。
答案 1 :(得分:1)
我认为这是缓存问题。具体来说,错误共享会减慢算法的速度,因为数据被写入到多个线程之间共享的页面中。 (不同的处理器内核会尝试跟上共享内存页面的速度)如果min_size_to_thread
是处理器的 page-size 的倍数,并且阵列是 aligned 在页面边界上,>性能会提高。在这种情况下,页面将不会在线程之间共享。
我总是将线程的创建限制为恒定数量,在四核计算机上运行100个线程只是为了对数组进行排序是没有意义的。 由于繁重的上下文切换,在单个核心上运行多个线程的成本很高。根据我的经验,最大线程数始终是核心数乘以2。单个核心可以处理大约2个线程而无需性能成本。对于四核CPU,该程序应一次最多运行8个线程。
这意味着一个算法可以创建8个子线程,父线程仅join
个线程,或创建7个子线程,在父线程中运行算法的一部分,最后{{1} }其他7个线程。
始终使用概要文件,原因可能完全不同。