编辑2

Question

我正在编写一个使用多线程的控制台应用程序。每个线程使用opencv函数处理一组图像。

如果在单个线程中执行使用opencv函数的函数，我会得到一个参考计算时间。如果我从多个线程执行此函数，则该函数（在每个线程中单独使用）要慢得多（几乎是两倍），而它应该几乎相同。

¿opencv是否并行化，序列化或阻止执行？。

我使用编译为WITH_TBB且没有TBB的opencv库测试应用程序，结果几乎相同。我不知道它是否有任何影响，但我也看到一些函数，如cv :: threshold或cv :: findcontours在执行beein时会创建12个额外的子进程。如果对open cv调用进行了注释，则所有线程的时间都相同，并且与单个线程执行中获得的时间相同，因此在这种情况下，多线程运行良好。问题是，是否存在opencv编译选项或函数调用，它允许在多线程和单线程执行中获得相同的时间??

EDIT 这是增加4核CPU中线程（核心）数量的结果，与1,2,3和4核心执行相同的功能。每个核心进程768在for循环中以1600x1200分辨率成像。在循环内部，调用导致增加延迟的函数。我希望，与核心数量无关，单个线程（35000ms）或10％以上的时间大致相同，但是，可以看出，当线程数量增加时，时间会增加，我可以找不到原因......

时间:(抱歉，系统不允许我将图片上传到帖子）

time in File No. 3 --> 35463
 Mean time using 1 cores is: 47ms

time in File No. 3 --> 42747
 time in File No. 3 --> 42709
 Mean time using 2 cores is: 28ms

time in File No. 3 --> 54587
 time in File No. 3 --> 54595
 time in File No. 3 --> 54437
 Mean time using 3 cores is: 24ms

time in File No. 3 --> 68751
 time in File No. 3 --> 68865
 time in File No. 3 --> 68878
 time in File No. 3 --> 68622
 Mean time using 4 cores is: 22ms

如果在该函数中没有使用opencv代码，则所有情况下，1,2,3或4个线程的时间与预期相似，但是当使用open cv函数时，例如只需简单调用：

img.convertTo（IMG，CV_32F）;

如果线程数增加，则时间会增加。我已经进行了测试，也禁用了CPU Bios中的hiper-threading选项。在那种情况下，所有时间减少，是1个线程25.000ms的时间，但时间增加的问题仍然存在（33秒有2个线程，43个有3个，57个有4个）...我不知道这是否告诉你东西

编辑2

A mcve：

#include "stdafx.h"
#include <future>
#include <chrono>
#include "Filter.h"
#include <iostream>
#include <future>


#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>

long long Ticks();
int WithOpencv(cv::Mat img);
int With_OUT_Opencv(cv::Mat img);
int TestThreads (char *buffer,std::string file);
#define Blur3x3(matrix,f,c) ((matrix[(f-1)*1600+(c-1)] + matrix[(f-1)*1600+c] + matrix[(f-1)*1600+(c+1)] + matrix[f*1600+(c-1)] + matrix[f*1600+c] + matrix[f*1600+(c+1)] + matrix[(f+1)*1600+(c-1)] + matrix[(f+1)*1600+c] + matrix[(f+1)*1600+(c+1)])/9)


int _tmain(int argc, _TCHAR* argv[])
{

    std::string file="Test.bmp";

    auto function = [&](char *buffer){return TestThreads(buffer,file);};
    char *buffers[12];
    std::future<int> frames[12];
    DWORD tid;
    int i,j;
    int nframes = 0;
    int ncores;

    cv::setNumThreads(8);

    for (i=0;i<8;i++) buffers[i] = new char[1000*1024*1024];
    for (j=1;j<9;j++)
    {
        ncores = j;
        long long t = Ticks();
        for (i=0;i<ncores;i++) frames[i] = std::async(std::launch::async,function,buffers[i]);
        for (i=0;i<ncores;i++) nframes += frames[i].get();
        t = Ticks() - t;

        std::cout << "Mean time using " << ncores << " cores is: " << t/nframes << "ms" << std::endl << std::endl;
        nframes = 0;
        Sleep(2000);
    }
    for (int i=0;i<8;i++) delete buffers[i];

    return NULL;

    return 0;
}



int TestThreads (char *buffer,std::string file)
{

    long long ta;
    int res;

        char *ruta=new char[file.length() + 1];
        strcpy(ruta,file.c_str());


        cv::Mat img (1200, 1600, CV_8UC1);
        img=cv::imread(file);


        ta = Ticks();
        for (int i=0;i<15;i++) {

            //Uncomment this and comment next line to test without opencv calls. With_OUT_Opencv implements simple filters with direct operations over mat data
            //res = With_OUT_Opencv(img);

            res = WithOpencv(img);


        }

        ta = Ticks() - ta;
        std::cout << "Time in file No. 3--> " << ta << std::endl;


        return 15;
}



int WithOpencv(cv::Mat img){

    cv::Mat img_bin;    
    cv::Mat img_filtered;
    cv::Mat img_filtered2;
    cv::Mat img_res;
    int Crad_morf=2;
    double Tthreshold=20;
    cv::Mat element = cv::getStructuringElement(cv::MORPH_ELLIPSE, cv::Size(2*Crad_morf + 1, 2*Crad_morf+1));

    img.convertTo(img,CV_32F);
    cv::blur(img, img_filtered, cv::Size(3, 3));
    cv::blur(img.mul(img), img_filtered2, cv::Size(3, 3));
    cv::sqrt(img_filtered2 - img_filtered.mul(img_filtered), img_res);
    cv::normalize(img_res, img_res, 0.0, 1.0, cv::NORM_MINMAX);
    img_res.convertTo(img_res,CV_8UC1,255.0);
    cv::threshold(img_res, img_bin, Tthreshold, 255, cv::THRESH_BINARY);

    if (Crad_morf!=0){
        cv::dilate(img_bin, img_bin, element);
    }

    return 0;
}





int With_OUT_Opencv(cv::Mat img){

    unsigned char *baux1 = new unsigned char[1600*1200];
    unsigned short *baux2 = new unsigned short[1600*1200];
    unsigned char max=0; 
    int f,c,i;
    unsigned char threshold = 177;

    for (f=1;f<1199;f++)                                // Bad Blur filters
    {
        for (c=1; c<1599; c++)
        {
            baux1[f*1600+c] = Blur3x3(img.data,f,c);
            baux1[f*1600+c] = baux1[f*1600+c] * baux1[f*1600+c];
            baux2[f*1600+c] = img.data[f*1600+c] * img.data[f*1600+c];
        }
    }
    for (f=1;f<1199;f++)
    {
        for (c=1; c<1599; c++)
        {
            baux1[f*1600+c] = sqrt(Blur3x3(baux2,f,c) - baux1[f*1600+c]);
            if (baux1[f*1600+c] > max) max = baux1[f*1600+c];
        }
    }
    threshold = threshold * ((float)max/255.0);         // Bad Norm/Bin
    for (i=0;i<1600*1200;i++)
    {
        if (baux1[i]>threshold) baux1[i] = 1;
        else baux1[i] = 0;
    }

    delete []baux1;
    delete []baux2;

    return 0;
}




long long Ticks()
{
   static long long last = 0;
   static unsigned ticksPerMS = 0;
   LARGE_INTEGER largo;

   if (last==0)
   {
       QueryPerformanceFrequency(&largo);
       ticksPerMS = (unsigned)(largo.QuadPart/1000);
       QueryPerformanceCounter(&largo);
       last = largo.QuadPart;
       return 0;
   }
   QueryPerformanceCounter(&largo);
   return (largo.QuadPart-last)/ticksPerMS;
}

Answer 1

我很困惑你的问题是什么。

您的初步问题表明，以串行方式运行x次迭代比并行运行要快得多。注意：使用相同的目标函数时。而且你想知道为什么在多线程场景中运行相同的目标函数要慢得多。

但是，我现在看到您的示例是将OpenCV的性能与其他一些自定义代码进行比较。这是你的问题是什么？

与我最初认为问题的问题相关，答案是：不，以串行方式运行目标函数并不比并行运行目标函数快得多。请参阅下面的结果和代码。

结果

eight threads took 4104.38 ms
single thread took 7272.68 ms
four threads took 3687 ms
two threads took 4500.15 ms

（在Apple MBA 2012 i5＆amp; opencv3上）

测试代码

#include <iostream>
#include <vector>
#include <chrono>
#include <thread>
#include <opencv2/opencv.hpp>

using namespace std;
using namespace std::chrono;
using namespace cv;

class benchmark {
    time_point<steady_clock> start = steady_clock::now();
    string title;
public:
    benchmark(const string& title) : title(title) {}

    ~benchmark() {
        auto diff = steady_clock::now() - start;
        cout << title << " took " << duration <double, milli> (diff).count() << " ms" << endl;
    }
};

template <typename F>
void repeat(unsigned n, F f) {
    while (n--) f();
};



int targetFunction(Mat img){
    cv::Mat img_bin;
    cv::Mat img_filtered;
    cv::Mat img_filtered2;
    cv::Mat img_res;
    int Crad_morf=2;
    double Tthreshold=20;
    cv::Mat element = cv::getStructuringElement(cv::MORPH_ELLIPSE, cv::Size(2*Crad_morf + 1, 2*Crad_morf+1));

    img.convertTo(img,CV_32F);
    cv::blur(img, img_filtered, cv::Size(3, 3));
    cv::blur(img.mul(img), img_filtered2, cv::Size(3, 3));
    cv::sqrt(img_filtered2 - img_filtered.mul(img_filtered), img_res);
    cv::normalize(img_res, img_res, 0.0, 1.0, cv::NORM_MINMAX);
    img_res.convertTo(img_res,CV_8UC1,255.0);
    cv::threshold(img_res, img_bin, Tthreshold, 255, cv::THRESH_BINARY);

    if (Crad_morf!=0){
        cv::dilate(img_bin, img_bin, element);
    }

    //imshow("WithOpencv", img_bin);

    return 0;
}

void runTargetFunction(int nIterations, int nThreads, const Mat& img) {
    int nIterationsPerThread = nIterations / nThreads;
    vector<thread> threads;
    auto targetFunctionFn = [&img]() {
        targetFunction(img);
    };

    setNumThreads(nThreads);

    repeat(nThreads, [&] {
        threads.push_back(thread([=]() {
            repeat(nIterationsPerThread, targetFunctionFn);
        }));
    });

    for(auto& thread : threads)
        thread.join();
}

int main(int argc, const char * argv[]) {
    string file = "../../opencv-test/Test.bmp";
    auto img = imread(file);

    const int nIterations = 64;

    // let's run using eight threads
    {
        benchmark b("eight threads");
        runTargetFunction(nIterations, 8, img);
    }

    // let's run using a single thread
    {
        benchmark b("single thread");
        runTargetFunction(nIterations, 1, img);
    }

    // let's run using four threads
    {
        benchmark b("four threads");
        runTargetFunction(nIterations, 4, img);
    }

    // let's run using a two threads
    {
        benchmark b("two threads");
        runTargetFunction(nIterations, 2, img);
    }

    return 0;
}

Answer 2

你正在测量三件事：

所有线程完成整个任务所需的时间除以整个任务的大小。
每个线程完成其部分任务所需的时间。
完成整个任务所需的时间。

您正在观察当增加线程数时，第一次从47ms下降到22ms。那很好！与此同时，您意识到单个线程所需的时间从35463增加到大约68751（无论单位）。最后，您意识到整体执行时间会增加。

关于第二次测量：当增加线程数时，各个线程需要更长时间来执行相应的操作。两种可能的解释：

您的线程正在争夺内存总线带宽。
您的线程正在触发多线程的计算，因此它们有效地相互竞争CPU时间。

现在提出为什么整体工作时间增加的问题。原因很简单：您不仅要增加线程数，还要以相同的速率增加工作量。如果您的线程根本没有相互竞争，并且不会产生任何开销，N线程将需要相同的时间来完成N次工作。它没有，所以你注意到减速。

多线程中的opencv慢得多

编辑2

2 个答案:

结果

测试代码