Question

我正在研究一种递归算法，我们想要并行化以提高性能。

我使用Visual c + + 12.0和＆lt;线程＆gt;图书馆。但是，我没有看到任何性能改进。所花费的时间要少于几毫秒，要么超过单线程的时间。

请告诉我是否做错了什么以及我应该对代码做出哪些更正。

这是我的代码

void nonRecursiveFoo(<className> &data, int first, int last)
{

    //process the data between first and last index and set its value to true based on some condition
    //no threads are created here
}


void recursiveFoo(<className> &data, int first, int last)
{

    int partitionIndex = -1;
    data[first]=true;
    data[last]=true;
    for (int i = first + 1; i < last; i++)
    {
        //some logic setting the index 
        If ( some condition is true)
            partitionIndex = i;
    }

//no dependency of partitions on one another and so can be parallelized
    if( partitionIndex != -1)
    {
        data[partitionIndex]=true;

        //assume some threadlimit
        if (Commons::GetCurrentThreadCount() < Commons::GetThreadLimit())
        {

            std::thread t1(recursiveFoo, std::ref(data), first, index);
            Commons::IncrementCurrentThreadCount();
            recursiveFoo(data, partitionIndex , last);
            t1.join();
        }
        else
        {
            nonRecursiveFoo(data, first, partitionIndex );
            nonRecursiveFoo(data, partitionIndex , last);
        }

    }
}

//主

int main()
{
    recursiveFoo(data,0,data.size-1);
}

//公地

std::mutex threadCountMutex;
static void Commons::IncrementCurrentThreadCount()
{
    threadCountMutex.lock();
        CurrentThreadCount++;
    threadCountMutex.unlock();
}

static int GetCurrentThreadCount()
{
    return CurrentThreadCount;
}
static void SetThreadLimit(int count)
{
    ThreadLimit = count;
}
static int GetThreadLimit()
{
    return ThreadLimit;
}
static int GetMinPointsPerThread()
{
    return MinimumPointsPerThread;
}

Answer 1

如果没有进一步的信息（见评论），这主要是猜测，但有一些事情需要注意：

首先，确保您的分区逻辑与处理相比非常短且快。否则，你只是创造了比获得处理能力更多的工作。
确保有足够的工作开始，或者加速可能不足以支付线程创建的额外开销。
检查您的工作是否在不同的线程中均匀分布，并且不会产生比计算机上的核心更多的线程（最后打印总线程数 - 不要依赖于您的{ {1}}）。
不要让你的分区太小，（特别是不低于64字节），否则你最终会错误分享。
将ThreadLimit作为CurrentThreadCount实施会更有效率，在这种情况下，您不需要互斥锁。
在创建线程之前放置计数器的增量。否则，新创建的线程可能会在计数器递增之前读取计数器并再次生成新线程，即使已经达到最大线程数（这仍然不是一个完美的解决方案，但我只会投入更多时间在此你已经验证过，过度使用是你的实际问题）
如果您确实必须使用互斥锁（出于示例代码之外的原因），则必须将其用于每次访问std::atomic<int>（读取和写入访问）。否则，这是 - 严格来说 - 竞争条件，因此是UB。

Answer 2

通过使用t1.join，您基本上等待其他线程完成 - 即没有并行执行任何操作。

通过查看您的算法，我不知道如何通过使用线程来并行化（从而改进） - 您必须等待单个递归调用结束。

Answer 3

首先，在创建的线程完成之前，您不会并行执行任何操作，因为每个线程创建都会阻塞。因此，您的多线程代码将始终比非多线程版本慢。

为了并行化，您可以为该部分生成线程，其中调用非递归函数，将线程ID放入向量并通过遍历向量连接到递归的最高级别。（虽然有更优雅的方法可以做到这一点，但对于第一个应该没问题，我想）。

因此，所有非递归调用都将并行运行。但是你应该使用另一个条件而不是最大线程数，但问题的大小，例如last-first<threshold。

多线程递归程序c ++

3 个答案: