我正在编写一个使用多线程的控制台应用程序。每个线程使用opencv函数处理一组图像。
如果在单个线程中执行使用opencv函数的函数,我会得到一个参考计算时间。如果我从多个线程执行此函数,则该函数(在每个线程中单独使用)要慢得多(几乎是两倍),而它应该几乎相同。
¿opencv是否并行化,序列化或阻止执行?。
我使用编译为WITH_TBB且没有TBB的opencv库测试应用程序,结果几乎相同。我不知道它是否有任何影响,但我也看到一些函数,如cv :: threshold或cv :: findcontours在执行beein时会创建12个额外的子进程。如果对open cv调用进行了注释,则所有线程的时间都相同,并且与单个线程执行中获得的时间相同,因此在这种情况下,多线程运行良好。问题是,是否存在opencv编译选项或函数调用,它允许在多线程和单线程执行中获得相同的时间??
EDIT 这是增加4核CPU中线程(核心)数量的结果,与1,2,3和4核心执行相同的功能。每个核心进程768在for循环中以1600x1200分辨率成像。在循环内部,调用导致增加延迟的函数。我希望,与核心数量无关,单个线程(35000ms)或10%以上的时间大致相同,但是,可以看出,当线程数量增加时,时间会增加,我可以找不到原因......
时间:(抱歉,系统不允许我将图片上传到帖子)
time in File No. 3 --> 35463
Mean time using 1 cores is: 47ms
time in File No. 3 --> 42747
time in File No. 3 --> 42709
Mean time using 2 cores is: 28ms
time in File No. 3 --> 54587
time in File No. 3 --> 54595
time in File No. 3 --> 54437
Mean time using 3 cores is: 24ms
time in File No. 3 --> 68751
time in File No. 3 --> 68865
time in File No. 3 --> 68878
time in File No. 3 --> 68622
Mean time using 4 cores is: 22ms
如果在该函数中没有使用opencv代码,则所有情况下,1,2,3或4个线程的时间与预期相似,但是当使用open cv函数时,例如只需简单调用:
img.convertTo(IMG,CV_32F);
如果线程数增加,则时间会增加。我已经进行了测试,也禁用了CPU Bios中的hiper-threading选项。在那种情况下,所有时间减少,是1个线程25.000ms的时间,但时间增加的问题仍然存在(33秒有2个线程,43个有3个,57个有4个)...我不知道这是否告诉你东西A mcve:
#include "stdafx.h"
#include <future>
#include <chrono>
#include "Filter.h"
#include <iostream>
#include <future>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
long long Ticks();
int WithOpencv(cv::Mat img);
int With_OUT_Opencv(cv::Mat img);
int TestThreads (char *buffer,std::string file);
#define Blur3x3(matrix,f,c) ((matrix[(f-1)*1600+(c-1)] + matrix[(f-1)*1600+c] + matrix[(f-1)*1600+(c+1)] + matrix[f*1600+(c-1)] + matrix[f*1600+c] + matrix[f*1600+(c+1)] + matrix[(f+1)*1600+(c-1)] + matrix[(f+1)*1600+c] + matrix[(f+1)*1600+(c+1)])/9)
int _tmain(int argc, _TCHAR* argv[])
{
std::string file="Test.bmp";
auto function = [&](char *buffer){return TestThreads(buffer,file);};
char *buffers[12];
std::future<int> frames[12];
DWORD tid;
int i,j;
int nframes = 0;
int ncores;
cv::setNumThreads(8);
for (i=0;i<8;i++) buffers[i] = new char[1000*1024*1024];
for (j=1;j<9;j++)
{
ncores = j;
long long t = Ticks();
for (i=0;i<ncores;i++) frames[i] = std::async(std::launch::async,function,buffers[i]);
for (i=0;i<ncores;i++) nframes += frames[i].get();
t = Ticks() - t;
std::cout << "Mean time using " << ncores << " cores is: " << t/nframes << "ms" << std::endl << std::endl;
nframes = 0;
Sleep(2000);
}
for (int i=0;i<8;i++) delete buffers[i];
return NULL;
return 0;
}
int TestThreads (char *buffer,std::string file)
{
long long ta;
int res;
char *ruta=new char[file.length() + 1];
strcpy(ruta,file.c_str());
cv::Mat img (1200, 1600, CV_8UC1);
img=cv::imread(file);
ta = Ticks();
for (int i=0;i<15;i++) {
//Uncomment this and comment next line to test without opencv calls. With_OUT_Opencv implements simple filters with direct operations over mat data
//res = With_OUT_Opencv(img);
res = WithOpencv(img);
}
ta = Ticks() - ta;
std::cout << "Time in file No. 3--> " << ta << std::endl;
return 15;
}
int WithOpencv(cv::Mat img){
cv::Mat img_bin;
cv::Mat img_filtered;
cv::Mat img_filtered2;
cv::Mat img_res;
int Crad_morf=2;
double Tthreshold=20;
cv::Mat element = cv::getStructuringElement(cv::MORPH_ELLIPSE, cv::Size(2*Crad_morf + 1, 2*Crad_morf+1));
img.convertTo(img,CV_32F);
cv::blur(img, img_filtered, cv::Size(3, 3));
cv::blur(img.mul(img), img_filtered2, cv::Size(3, 3));
cv::sqrt(img_filtered2 - img_filtered.mul(img_filtered), img_res);
cv::normalize(img_res, img_res, 0.0, 1.0, cv::NORM_MINMAX);
img_res.convertTo(img_res,CV_8UC1,255.0);
cv::threshold(img_res, img_bin, Tthreshold, 255, cv::THRESH_BINARY);
if (Crad_morf!=0){
cv::dilate(img_bin, img_bin, element);
}
return 0;
}
int With_OUT_Opencv(cv::Mat img){
unsigned char *baux1 = new unsigned char[1600*1200];
unsigned short *baux2 = new unsigned short[1600*1200];
unsigned char max=0;
int f,c,i;
unsigned char threshold = 177;
for (f=1;f<1199;f++) // Bad Blur filters
{
for (c=1; c<1599; c++)
{
baux1[f*1600+c] = Blur3x3(img.data,f,c);
baux1[f*1600+c] = baux1[f*1600+c] * baux1[f*1600+c];
baux2[f*1600+c] = img.data[f*1600+c] * img.data[f*1600+c];
}
}
for (f=1;f<1199;f++)
{
for (c=1; c<1599; c++)
{
baux1[f*1600+c] = sqrt(Blur3x3(baux2,f,c) - baux1[f*1600+c]);
if (baux1[f*1600+c] > max) max = baux1[f*1600+c];
}
}
threshold = threshold * ((float)max/255.0); // Bad Norm/Bin
for (i=0;i<1600*1200;i++)
{
if (baux1[i]>threshold) baux1[i] = 1;
else baux1[i] = 0;
}
delete []baux1;
delete []baux2;
return 0;
}
long long Ticks()
{
static long long last = 0;
static unsigned ticksPerMS = 0;
LARGE_INTEGER largo;
if (last==0)
{
QueryPerformanceFrequency(&largo);
ticksPerMS = (unsigned)(largo.QuadPart/1000);
QueryPerformanceCounter(&largo);
last = largo.QuadPart;
return 0;
}
QueryPerformanceCounter(&largo);
return (largo.QuadPart-last)/ticksPerMS;
}
答案 0 :(得分:1)
我很困惑你的问题是什么。
您的初步问题表明,以串行方式运行x次迭代比并行运行 要快得多。 注意:使用相同的目标函数时。而且你想知道为什么在多线程场景中运行相同的目标函数要慢得多。
但是,我现在看到您的示例是将OpenCV的性能与其他一些自定义代码进行比较。这是你的问题是什么?
与我最初认为问题的问题相关,答案是:不,以串行方式运行目标函数并不比并行运行目标函数快得多。请参阅下面的结果和代码。
eight threads took 4104.38 ms
single thread took 7272.68 ms
four threads took 3687 ms
two threads took 4500.15 ms
(在Apple MBA 2012 i5&amp; opencv3上)
#include <iostream>
#include <vector>
#include <chrono>
#include <thread>
#include <opencv2/opencv.hpp>
using namespace std;
using namespace std::chrono;
using namespace cv;
class benchmark {
time_point<steady_clock> start = steady_clock::now();
string title;
public:
benchmark(const string& title) : title(title) {}
~benchmark() {
auto diff = steady_clock::now() - start;
cout << title << " took " << duration <double, milli> (diff).count() << " ms" << endl;
}
};
template <typename F>
void repeat(unsigned n, F f) {
while (n--) f();
};
int targetFunction(Mat img){
cv::Mat img_bin;
cv::Mat img_filtered;
cv::Mat img_filtered2;
cv::Mat img_res;
int Crad_morf=2;
double Tthreshold=20;
cv::Mat element = cv::getStructuringElement(cv::MORPH_ELLIPSE, cv::Size(2*Crad_morf + 1, 2*Crad_morf+1));
img.convertTo(img,CV_32F);
cv::blur(img, img_filtered, cv::Size(3, 3));
cv::blur(img.mul(img), img_filtered2, cv::Size(3, 3));
cv::sqrt(img_filtered2 - img_filtered.mul(img_filtered), img_res);
cv::normalize(img_res, img_res, 0.0, 1.0, cv::NORM_MINMAX);
img_res.convertTo(img_res,CV_8UC1,255.0);
cv::threshold(img_res, img_bin, Tthreshold, 255, cv::THRESH_BINARY);
if (Crad_morf!=0){
cv::dilate(img_bin, img_bin, element);
}
//imshow("WithOpencv", img_bin);
return 0;
}
void runTargetFunction(int nIterations, int nThreads, const Mat& img) {
int nIterationsPerThread = nIterations / nThreads;
vector<thread> threads;
auto targetFunctionFn = [&img]() {
targetFunction(img);
};
setNumThreads(nThreads);
repeat(nThreads, [&] {
threads.push_back(thread([=]() {
repeat(nIterationsPerThread, targetFunctionFn);
}));
});
for(auto& thread : threads)
thread.join();
}
int main(int argc, const char * argv[]) {
string file = "../../opencv-test/Test.bmp";
auto img = imread(file);
const int nIterations = 64;
// let's run using eight threads
{
benchmark b("eight threads");
runTargetFunction(nIterations, 8, img);
}
// let's run using a single thread
{
benchmark b("single thread");
runTargetFunction(nIterations, 1, img);
}
// let's run using four threads
{
benchmark b("four threads");
runTargetFunction(nIterations, 4, img);
}
// let's run using a two threads
{
benchmark b("two threads");
runTargetFunction(nIterations, 2, img);
}
return 0;
}
答案 1 :(得分:1)
你正在测量三件事:
您正在观察当增加线程数时,第一次从47ms下降到22ms。那很好!与此同时,您意识到单个线程所需的时间从35463增加到大约68751(无论单位)。最后,您意识到整体执行时间会增加。
关于第二次测量:当增加线程数时,各个线程需要更长时间来执行相应的操作。两种可能的解释:
现在提出为什么整体工作时间增加的问题。原因很简单:您不仅要增加线程数,还要以相同的速率增加工作量。如果您的线程根本没有相互竞争,并且不会产生任何开销,N
线程将需要相同的时间来完成N
次工作。它没有,所以你注意到减速。