Question

我有不同旋转方向的图像。我想使用互相关最大化找到正确的旋转角度。由于我的图片集很大，我想使用mex文件here加快normxcorr2功能。

我使用以下代码计算matched_angle：

function [matched_angle, max_corr_vecq, matched_angle_mex, max_corr_vecq_mex] = get_correct_rotation(moving, fixed)

    for theta = 360:-10:10       
        rotated = imrotate(moving, theta,'bicubic','crop');

        corr2d_map = normxcorr2(double(rotated), double(fixed));
        corr2d_map_mex = normxcorr2_mex(double(rotated), double(fixed),'full');

        [max_corr_vec(theta/10), ~] = max(corr2d_map(:));
        [max_corr_vec_mex(theta/10), ~] = max(corr2d_map_mex(:));
    end

    % Interpolate correlation max vector for half degree resolution
    max_corr_vecq = interp1(10:10:360, max_corr_vec, 0.5:0.5:360, 'spline');
    [~, matched_angle] = max(max_corr_vecq);
    matched_angle = 0.5 * matched_angle;

    % Interpolate correlation max vector for half degree resolution
    max_corr_vecq_mex = interp1(10:10:360, max_corr_vec_mex, 0.5:0.5:360, 'spline');
    [~, matched_angle_mex] = max(max_corr_vecq_mex);
    matched_angle_mex = 0.5 * matched_angle_mex;
end

然而，对于两个不同的normxcorr2＆amp;同样使用这两个相同的图像（Moving Template Image＆amp; Fixed Reference Image） normxcorr2_mex给出完全不同的结果。

plot(0.5:0.5:360, max_corr_vecq, 'linewidth',2); hold on;
plot(0.5:0.5:360, max_corr_vecq_mex, 'linewidth',2);
legend({'MATLAB Built-in', 'MEX'});
set(gca, 'FontSize', 14, 'FontWeight', 'bold');

请参阅Result Plot。

有没有人知道发生了什么？我找不到关于该mex文件准确性的任何条目。据作者说：

以下是等效的：
  result = normxcorr2_mex(template, image, 'full'); 
和
  result = normxcorr2(template, image);
除了normxcorr2_mex沿边界'无效'区域有0'

在我的情况下应该不是问题。因为我只检查最大相关值。

Answer 1

自从我上次回答以来，我发现normcorr2_mex library始终比我的用例要慢（比MATLAB慢）并且在 all 中都是不正确的。

由于我确实需要C ++实现（可以使用MATLAB进行验证），因此我创建了自己的实现。代码在这里列出：

/* normxcorr2_mex.cpp   
 *
 *  A MATLAB-mex wrapper around a C/C++ implementation of the Normalised Cross Correlation algorithm described 
 * by @dafnahaktana in https://stackoverflow.com/questions/44591037/speed-up-calculation-of-maximum-of-normxcorr2.
 *
 *  This module uses the 'integral image' data structure described in the posted MATLAB/Octave code (based upon the
 * original Industrial Light & Magic paper at http://scribblethink.org/Work/nvisionInterface/nip.pdf), but replaces 
 * the "naive" correlation step with a Fourier transform implementation for larger template sizes.
 *
 *  Daniel Eaton released a MATLAB-mex library (http://www.cs.ubc.ca/research/deaton/remarks_ncc.html) with the 
 * same function name as this one in 2013.  Indeed, I acknowledge [and flatteringly plagiarise] his interface and 
 * naming convention.  Unfortunaly, I was unable to duplicate the speed (wrt MATLABs normxcorr2) improvements he
 * claimed with the image sizes I required.  Curiously, I also observed different results using his library compared
 * with MATLABs built-in function (despite being claimed to be identical).  This was also noted by others here:
 * https://stackoverflow.com/questions/48641648/different-results-of-normxcorr2-and-normxcorr2-mex.  This module
 * does match normxcorr2 on both the MATLAB R2016b and R2017a/b versions tested, using the (accompanying) test script.
 * Like Daniel's module, however, this function returns only the 'valid' region of correlation values, i.e. it 
 * doesn't pad the output array to match the input image size.  
 *
 *  This function is called via:
 *                                 NCC = normxcorr2_mex (TEMPLATE, A);
 *  Where:  
 *    TEMPLATE - The (double precision) matrix to correlate with A. 
 *    A        - (Double precision) input matrix for correlation with the TEMPLATE.  Note size(A) > size(TEMPLATE).
 *    NCC      - is the computed normalised cross correlation coefficients of the matrices TEMPLATE and A.
 *               The size of the correlation coefficient matrix is given as:
 * 
 *                              size(NCC) = [(Ar - TEMPLATEr + 1), (Ac - TEMPLATEc + 1)]  ; where:
 * 
 *                Ar, Ac and TEMPLATEr, TEMPLATEc are the number of (rows, cols) of A and TEMPLATE respectively.
 *
 *  This module requires the Eigen C++ library (http://eigen.tuxfamily.org/index.php?title=Main_Page) for compilation
 * and may be compiled within MATLAB via:
 *
 *                                  mex -I'[Path to]\eigen-3.3.5' normxcorr2_mex.cpp
 *
 *  Since NCC is such a computationally intensive task, this module may be linked against the openMP library to exploit a
 * pool of worker threads and distribute some of the embarrassingly parellel operations within across a number of CPU cores.
 * Only rudimentary use is made of the library, but the following compilation option provides speedups generally 
 * exceeding 50%:
 *
 *    mex -I'[Path to]\eigen-3.3.5' CXXFLAGS="$CXXFLAGS -fopenmp" LDFLAGS="$LDFLAGS -fopenmp" normxcorr2_mex.cpp
 *
 *
 *  You are free to do with this code as you wish.  For this reason, it is released under the UNLICENSE model:
 * 
 *                   This is free and unencumbered software released into the public domain.
 *                   
 *                   Anyone is free to copy, modify, publish, use, compile, sell, or
 *                   distribute this software, either in source code form or as a compiled
 *                   binary, for any purpose, commercial or non-commercial, and by any
 *                   means.
 *                   
 *                   In jurisdictions that recognize copyright laws, the author or authors
 *                   of this software dedicate any and all copyright interest in the
 *                   software to the public domain. We make this dedication for the benefit
 *                   of the public at large and to the detriment of our heirs and
 *                   successors. We intend this dedication to be an overt act of
 *                   relinquishment in perpetuity of all present and future rights to this
 *                   software under copyright law.
 *                   
 *                   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 *                   EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 *                   MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
 *                   IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
 *                   OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
 *                   ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
 *                   OTHER DEALINGS IN THE SOFTWARE.
 *                   
 *                   For more information, please refer to <http://unlicense.org/>
 */

#include "mex.h"
#include <cstring>
#include <algorithm>
#include <limits>
#include <vector>
#include <cmath>
#include <complex>
#include <iostream>
#include <Eigen/Core>
#include <unsupported/Eigen/FFT> 

using namespace Eigen;

// If we're compiled/linked with openMP, turn off Eigen's parallelisation

#ifdef _OPENMP
   #define EIGEN_DONT_PARALLELIZE
   #define EIGEN_NO_DEBUG
#endif



// For very small input templates, performing the raw 2D correlation in the spatial domain may be faster than
// the transform domain (due to the overhead that the latter involves).  The decision which approach to use is 
// made at runtime by comparing the size (=rows*cols) of the input TEMPLATE matrix with the following constant.
// Feel free to experiment with this value in your own application!

#define TEMPLATE_SIZE_THRESHOLD 401


// 2D Cross-correlation performed via the "naive approach" (laborious spatial domain convolution).

ArrayXXd spatialXcorr (const Ref<const ArrayXXd>& img, const Ref<const ArrayXXd>& templ)
  {
  int32_t r, c;
  ArrayXXd xcorr2(img.rows()-templ.rows()+1, img.cols()-templ.cols()+1);

  for (r=0; r<(img.rows()-templ.rows()+1); r++)
     for (c=0; c<(img.cols()-templ.cols()+1); c++)
         xcorr2(r,c) = (templ*img.block(r,c,templ.rows(),templ.cols())).sum();

  return(xcorr2);
  }

// 2D Cross-correlation performed via Fourier transform 

ArrayXXd transformXcorr (const Ref<const ArrayXXd>& img, const Ref<const ArrayXXd>& templ)
  {
  ArrayXXd xcorr2(img.rows()-templ.rows()+1, img.cols()-templ.cols()+1);

  // Copy the input arrays into a matrix the next power-of-2 up in size
  int32_t nextPow2r = (int32_t)(pow(2.0, round(0.5+log((double)(img.rows()))/log(2.0))));
  int32_t nextPow2c = (int32_t)(pow(2.0, round(0.5+log((double)(img.cols()))/log(2.0))));
  MatrixXd imgPwr2   = MatrixXd::Zero(nextPow2r, nextPow2c);
  MatrixXd templPwr2 = MatrixXd::Zero(nextPow2r, nextPow2c);

  // A -> copied to top-left corner. 
  // TEMPLATE is rotated 180 degrees to account for rotation/flip performed during convolution.
  imgPwr2.block(0, 0, img.rows(), img.cols()) = img.matrix();
  templPwr2.block(0, 0, templ.rows(), templ.cols()) = (templ.matrix().colwise().reverse()).rowwise().reverse();

  // Perform 2D FFTs via sequential 1D transforms (Rows first, then columns)
  MatrixXcd imgFT(nextPow2r, nextPow2c), templFT(nextPow2r, nextPow2c), prodFT(nextPow2r, nextPow2c);

  // Rows first...
  #ifdef _OPENMP                                          // If using parallel threads, then each thread
                                                           // must have it's own copy of the eigenFFT plan.
     #pragma omp parallel for schedule(dynamic)            
     for (int32_t r=0; r<nextPow2r; r++) {                 // This is unnecesary for single-threaded execution as
                                                           // each evaluation of the FFT is identical in length 
        VectorXcd rowVec(nextPow2c);                       // and data type.
        FFT<double> eigenFFT;
                                                           // The creation of the plan is computationally expensive
  #else                                                    // and so we do it once, outside of the loop in the single
                                                           // threaded case (to reduce the run time by a factor > 2).
     VectorXcd rowVec(nextPow2c);
     FFT<double> eigenFFT;
     for (int32_t r=0; r<nextPow2r; r++) {

  #endif     
        eigenFFT.fwd(rowVec, imgPwr2.row(r));
        imgFT.row(r) = rowVec;
        eigenFFT.fwd(rowVec, templPwr2.row(r));
        templFT.row(r) = rowVec; 
        }

  // ...then columns.
  #ifdef _OPENMP

     #pragma omp parallel for schedule(dynamic)
     for (int32_t c=0; c<nextPow2c; c++) {

        VectorXcd colVec(nextPow2r);
        FFT<double> eigenFFT;

  #else 

     VectorXcd colVec(nextPow2r);
     for (int32_t c=0; c<nextPow2c; c++) {

  #endif     
        eigenFFT.fwd(colVec, imgFT.col(c));
        imgFT.col(c) = colVec;
        eigenFFT.fwd(colVec, templFT.col(c));
        templFT.col(c) = colVec;
        }  

  // Mutliply complex Fourier domain matricies 
  prodFT = imgFT.cwiseProduct(templFT);

  // Transform (complex) Fourier product back -> (real) spatial domain (2D IFFT). 
  // Reuse templPwr2 as the output variable for efficiency.

  // Rows first (again)...
  #ifdef _OPENMP
     #pragma omp parallel for schedule(dynamic)
     for (int32_t r=0; r<nextPow2r; r++) {

        FFT<double> eigenFFT;  
        VectorXcd rowVec(nextPow2c);

  #else
     for (int32_t r=0; r<nextPow2r; r++) {
  #endif
        eigenFFT.inv(rowVec, prodFT.row(r));
        prodFT.row(r) = rowVec;
        }

  // ...and lastly, columns.
  #ifdef _OPENMP
     #pragma omp parallel for schedule(dynamic)
     for (int32_t c=0; c<nextPow2c; c++) {

        FFT<double> eigenFFT;  
        VectorXcd colVec(nextPow2r);

  #else
     for (int32_t c=0; c<nextPow2c; c++) {
  #endif    
        eigenFFT.inv(colVec, prodFT.col(c));
        templPwr2.col(c) = colVec.real();
        }

  // Extract the valid region of correlation coefficients
  xcorr2 = templPwr2.array().block(templ.rows()-1, templ.cols()-1, img.rows()-templ.rows()+1, img.cols()-templ.cols()+1);
  return(xcorr2);
  }




// Normalised cross-correlation top-level function

ArrayXXd normxcorr2 (const Ref<const ArrayXXd>& templ, const Ref<const ArrayXXd>& img)
  {
  ArrayXXd templZMean(templ.rows(), templ.cols()); 
  ArrayXXd scalingCoeffs(img.rows() - templ.rows() +1, img.cols() - templ.cols() +1);
  ArrayXXd normxcorr(img.rows()-templ.rows()+1, img.cols()-templ.cols()+1);
  ArrayXXd integralImg(img.rows()+2, img.cols()+2), integralImgSq(img.rows()+2, img.cols()+2);
  ArrayXXd windowMeanA = ArrayXXd::Zero(img.rows() - templ.rows() +1, img.cols() - templ.cols() +1);
  ArrayXXd windowMeanASq = ArrayXXd::Zero(img.rows() - templ.rows() +1, img.cols() - templ.cols() +1);

  // Calculate the standard deviation of the TEMPLATE
  double templSizeRcp = 1.0/(double)(templ.rows()*templ.cols());
  templZMean = templ-templ.mean();
  double templateStd = sqrt((templZMean.pow(2)).sum()*templSizeRcp);

  // Compute mean and standard deviation of input matrix A over the template window size. Firsly...
  // Construct array for computing the integral image(s) + zero pad the edges to avoid boundary issues
  integralImg.block(0, 0, 1, integralImg.cols()) = ArrayXXd::Zero(1, integralImg.cols());
  integralImg.block(0, 0, integralImg.rows(), 1) = ArrayXXd::Zero(integralImg.rows(), 1);
  integralImg.block(0, integralImg.cols()-1, integralImg.rows(), 1) = ArrayXXd::Zero(integralImg.rows(), 1);
  integralImg.block(integralImg.rows()-1, 0, 1, integralImg.cols()) = ArrayXXd::Zero(1, integralImg.cols());

  integralImgSq.block(0, 0, 1, integralImgSq.cols()) = ArrayXXd::Zero(1, integralImgSq.cols());
  integralImgSq.block(0, 0, integralImgSq.rows(), 1) = ArrayXXd::Zero(integralImgSq.rows(), 1);
  integralImgSq.block(0, integralImgSq.cols()-1, integralImgSq.rows(), 1) = ArrayXXd::Zero(integralImgSq.rows(), 1);
  integralImgSq.block(integralImgSq.rows()-1, 0, 1, integralImgSq.cols()) = ArrayXXd::Zero(1, integralImgSq.cols());

  // Calculate cumulative sum.  Along the length of each row first...
  for (int32_t r=0; r<img.rows(); r++) {
     double sum = 0.0;
     double sumSq = 0.0;
     for (int32_t c=0; c<img.cols(); c++) {
        sum += img(r,c);
        sumSq += (img(r,c)*img(r,c));
        integralImg(r+1, c+1) = sum;
        integralImgSq(r+1, c+1) = sumSq;
        }
     }
  // ...and then down each column.
  for (int32_t c=1; c<=img.cols(); c++) {
     double sum = 0.0;
     double sumSq = 0.0;
     for (int32_t r=1; r<=img.rows(); r++) {
        sum += integralImg(r,c);
        sumSq += integralImgSq(r,c);
        integralImg(r,c) = sum;
        integralImgSq(r,c) = sumSq;
        }
     }

  // Determine start/finish indexes for the boundaries of the summed area
  int32_t rStart = (int32_t)(0.5 + templ.rows()/2.0);
  int32_t rEnd = img.rows() - rStart + (templ.rows() % 2);
  int32_t cStart = (int32_t)(0.5 + templ.cols()/2.0);
  int32_t cEnd = img.cols() - cStart + (templ.cols() % 2);

  // Evaluate the sum of intensities
  windowMeanA += ( integralImg.block(templ.rows(), templ.cols(), rEnd-rStart+1, cEnd-cStart+1) \
               - integralImg.block(templ.rows(), 0, rEnd-rStart+1, cEnd-cStart+1) \
               - integralImg.block(0, templ.cols(), rEnd-rStart+1, cEnd-cStart+1) \
               + integralImg.block(0, 0, rEnd-rStart+1, cEnd-cStart+1) )*templSizeRcp;

  // Evaluate the sum of intensities (squared)
  windowMeanASq += ( integralImgSq.block(templ.rows(), templ.cols(), rEnd-rStart+1, cEnd-cStart+1) \
                - integralImgSq.block(templ.rows(), 0, rEnd-rStart+1, cEnd-cStart+1) \
                - integralImgSq.block(0, templ.cols(), rEnd-rStart+1, cEnd-cStart+1) \
                + integralImgSq.block(0, 0, rEnd-rStart+1, cEnd-cStart+1) )*templSizeRcp;

  // Calculate the standard deviation (squared) of A over the template size window
  // Standard deviation = sqrt(windowMeanASq - windowMeanA.square());
  scalingCoeffs = (windowMeanASq - windowMeanA.square());

  // Amalgamate the element-by-element test/square root with other coefficients scaling for efficiency
  for (int32_t r=0; r<scalingCoeffs.rows(); r++) 
     for (int32_t c=0; c<scalingCoeffs.cols(); c++) 
        if (scalingCoeffs(r,c) > 0)
           scalingCoeffs(r,c) = templSizeRcp/(templateStd*sqrt(scalingCoeffs(r,c)));
        else
           scalingCoeffs(r,c) = std::numeric_limits<double>::quiet_NaN();

  // Decide which 2D correlation approach to use (transform or spatial domain) 
  if ((templ.rows()*templ.cols()) > TEMPLATE_SIZE_THRESHOLD)
     normxcorr = scalingCoeffs*transformXcorr(img, templZMean);
  else 
     normxcorr = scalingCoeffs*spatialXcorr(img, templZMean);

  return(normxcorr);
  }



// ******************** Minimal MEX wrapper ********************

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
  {
  // Check the number of arguments 
  if (nrhs != 2)
     mexErrMsgIdAndTxt("MATLAB:normxcorr2_mex", "Usage: NCC = normxcorr2_mex (TEMPLATE, A);"); 

  // Verify input array sizes
  size_t rowsTempl = mxGetM(prhs[0]);
  size_t colsTempl = mxGetN(prhs[0]);
  size_t rowsA = mxGetM(prhs[1]);
  size_t colsA = mxGetN(prhs[1]);

  if ((rowsA <= rowsTempl) || (colsA <= colsTempl))
     mexErrMsgIdAndTxt("MATLAB:normxcorr2_mex", "Size of TEMPLATE must be less than input matrix A."); 

  #ifdef _OPENMP
     // Required for Eigen versions < 3.3 and for *some* non-compliant C++11 compilers. 
     // (Warn Eigen our application might be calling it from multiple threads). 
     initParallel();
  #endif

  // Perform correlation
  ArrayXXd xcorr(rowsA-rowsTempl+1, colsA-colsTempl+1);
  xcorr = normxcorr2 (Map<ArrayXXd>(mxGetPr(prhs[0]), rowsTempl, colsTempl), Map<ArrayXXd>(mxGetPr(prhs[1]), rowsA, colsA));

  // Return data to MATLAB
  plhs[0] = mxCreateDoubleMatrix(rowsA-rowsTempl+1, colsA-colsTempl+1, mxREAL);
  Map<ArrayXXd> (mxGetPr(plhs[0]), xcorr.rows(), xcorr.cols()) = xcorr;

  return;
  }

根据标题中的注释，将文件保存到normxcorr2_mex.cpp并使用以下命令进行编译：

mex -I'[Path to]\eigen-3.3.5' normxcorr2_mex.cpp的单线程操作，或使用
mex -I'[Path to]\eigen-3.3.5' CXXFLAGS="$CXXFLAGS -fopenmp" LDFLAGS="$LDFLAGS -fopenmp" normxcorr2_mex.cpp用于多线程openMP支持。

可以使用以下MATLAB脚本来验证代码的时间安排和正确操作：

% testHarness.m  
%
% Verify the results of the compiled normxcorr2_mex() function against
% MATLABs inbuilt normxcorr2() function.  This takes aaaaages to run!

%% Simulation/comparison parameters

nRunsA = 50;              % Number of trials for accuracy comparison
nRunsT = 30;              % Number of repetitions for execution time detemination
nStepsT = 50;             % Number of input matrix size steps to take in execution time measurement 

maxImSize = [1343 1745];  % (Deliberately non-round-number) maximum image size for tests 
maxTemplSize = [248 379]; % Maximum image template size

%% Accuracy comparison

sumSqErr = zeros(1, nRunsA);
fprintf(2, 'Accuracy comparison\n');

for nRun = 1:nRunsA

    fprintf('Run %d (of %d)\n', nRun, nRunsA);

    % Create input images/templates of random content and size
    randSizeScale = 0.02 + 0.98*rand(1, 2);
    img = rand(round(maxImSize.*randSizeScale));
    templ = rand(round(maxTemplSize.*randSizeScale));

    % MATLABs inbuilt function 
    resultMatPadded = normxcorr2(templ, img);
    % Remove unwanted padding
    [rTempl, cTempl] = size(templ);
    [rImg, cImg] = size(img);
    resultMat = resultMatPadded(rTempl:rImg, cTempl:cImg);

    % MEX function
    resultMex = normxcorr2_mex(templ, img);

    % Compare results
    sumSqErr(nRun) = sum(sum( (resultMat-resultMex).^2 ));

end

figure;
plot(sumSqErr);
title('Accuracy comparison between MATLAB and MEX normxcorr2');
xlabel('Run #');
ylabel('\Sigma |MATLAB-MEX|^2');
grid on;

%% Timing comparison

avMatT = zeros(1, nStepsT);
avMexT = zeros(1, nStepsT);
fprintf(2, 'Timing comparison\n');

for stp = 1:nStepsT

    fprintf('Run %d (of %d)\n', stp, nStepsT);

    % Create input images/templates of random content and progressively larger size
    img = rand(round(maxImSize*stp/nStepsT));
    templ = rand(round(maxTemplSize.*stp/nStepsT));

    % MATLABs function
    tStart = tic;
    for exec = 1:nRunsT
        dummy =  normxcorr2(templ, img);
    end
    avMatT(stp) = toc(tStart)/nRunsT;

    % MEX function
    tStart = tic;
    for exec = 1:nRunsT
        dummy =  normxcorr2_mex(templ, img);
    end
    avMexT(stp) = toc(tStart)/nRunsT;

end

figure;
plot((1:nStepsT)/(0.01*nStepsT), avMatT, 'rx-', (1:nStepsT)/(0.01*nStepsT), avMexT, 'bo-');
title('Execution time comparison between MATLAB and MEX normxcorr2');
xlabel('Input array size [% of maximum]');
ylabel('Evaluation time [s]');
legend('MATLAB', 'MEX');
grid on;

上述C ++ / mex实现和MATLAB的内置normxcorr2函数agree达到了接近基本双精度数据类型极限的水平。事实证明，即使this在我的i7-980 CPU上运行时，最新的MATLAB normxcorr2甚至在使用openMP时，速度也很难击败。

Answer 2

不幸的是，我没有任何解释，但可以确认问题似乎出在库而不是您的实现上。我在Windows下使用MinGW64编译器构建normxcorr2_mex库时遇到了问题，这使我对构建之间的可能变化感到担心。与下图included here所示，与MATLAB内置的normxcorr2函数相比，在Debian Linux和Windows下的构建都表现出相同（不正确）的行为。

为帮助其他人在Windows下构建该库，我不得不使用以下命令行强制C ++编译器：

mex -O CXXFLAGS="$CXXFLAGS -std=c++03 -fpermissive" normxcorr2_mex.cpp cv_src/*.cpp

顺便说一句，我还发现mex的实现要比MATLAB慢一个数量级！

normxcorr2和normxcorr2_mex的不同结果

2 个答案: