两个形状之间的欧几里德距离矩阵性能

时间:2017-11-09 22:14:49

标签: r rcpp point-clouds rcppparallel

我遇到的问题是我必须计算形状之间的欧几里德距离矩阵,其范围从20,000到60,000点,产生10-20GB的数据量。我必须运行这些计算中的每一个数千次,所以20GB x 7,000(每个计算是一个不同的点云)。形状可以是2D或3D。

已编辑(更新的问题)

  1. 是否有更有效的方法来计算前后距离而不使用两个独立的嵌套循环?

    我知道我可以保存数据矩阵并计算最小值 每个方向的距离,但是存在巨大的内存问题 大点云。

  2. 有没有办法加快计算速度和/或清理代码以缩短时间?

  3. 具有讽刺意味的是,我只需要矩阵来计算一个非常简单的度量,但它需要整个矩阵才能找到该度量(Average Hausdorff distance)。

    数据示例,其中每列表示形状的尺寸,每行是形状中的一个点:

    first_configuration <- matrix(1:6,2,3)
    second_configuration <- matrix(6:11,2,3)
    colnames(first_configuration) <- c("x","y","z")
    colnames(second_configuration) <- c("x","y","z")
    

    此代码计算坐标之间的欧几里德距离:

    m <- nrow(first_configuration)
    n <- nrow(second_configuration)
    
    D <- sqrt(pmax(matrix(rep(apply(first_configuration * first_configuration, 1, sum), n), m, n, byrow = F) + matrix(rep(apply(second_configuration * second_configuration, 1, sum), m), m, n, byrow = T) - 2 * first_configuration %*% t(second_configuration), 0))
    D
    

    输出:

         [,1]      [,2]
    [1,] 8.660254 10.392305
    [2,] 6.928203  8.660254
    

    编辑:包括hausdorff平均代码

    d1 <- mean(apply(D, 1, min))
    d2 <- mean(apply(D, 2, min))
    average_hausdorff <- mean(d1, d2)
    

    EDIT(Rcpp解决方案): 这是我尝试在Rcpp中实现它,因此矩阵永远不会保存到内存中。现在工作但很慢。

    sourceCpp(code=
    #include <Rcpp.h>
    #include <limits>
    using namespace Rcpp;
    
    // [[Rcpp::export]]
    double edist_rcpp(NumericVector x, NumericVector y){
        double d = sqrt( sum( pow(x - y, 2) ) );
        return d;
    }
    
    
    // [[Rcpp::export]]
    double avg_hausdorff_rcpp(NumericMatrix x, NumericMatrix y){
        int nrowx = x.nrow();
        int nrowy = y.nrow();
        double new_low_x = std::numeric_limits<int>::max();
        double new_low_y = std::numeric_limits<int>::max();
    
        double mean_forward = 0;
        double mean_backward = 0;
        double mean_hd; 
        double td; 
    
        //forward
        for(int i = 0; i < nrowx; i++) {
            for(int j = 0; j < nrowy; j++) {
                NumericVector v1 = x.row(i);
                NumericVector v2 = y.row(j);
                td = edist_rcpp(v1, v2);
                if(td < new_low_x) {
                    new_low_x = td;
                }
            }
            mean_forward = mean_forward + new_low_x;
            new_low_x = std::numeric_limits<int>::max();
        }
    
        //backward
        for(int i = 0; i < nrowy; i++) {
            for(int j = 0; j < nrowx; j++) {
                NumericVector v1 = y.row(i);
                NumericVector v2 = x.row(j);
                td = edist_rcpp(v1, v2);
                if(td < new_low_y) {
                    new_low_y = td;
                }
            }
            mean_backward = mean_backward + new_low_y;
            new_low_y = std::numeric_limits<int>::max();
        }
    
        //hausdorff mean
        mean_hd = (mean_forward / nrowx + mean_backward / nrowy) / 2;
    
        return mean_hd;
    }
    )
    

    EDIT(RcppParallel解决方案): 绝对比串行Rcpp解决方案更快,而且肯定是R解决方案。如果有人有关于如何改进我的RcppParallel代码以减少一些额外时间的提示,那将非常感激!

    sourceCpp(code=
    #include <Rcpp.h>
    #include <RcppParallel.h>
    #include <limits>
    
    // [[Rcpp::depends(RcppParallel)]]
    struct minimum_euclidean_distances : public RcppParallel::Worker {
        //Input
        const RcppParallel::RMatrix<double> a;
        const RcppParallel::RMatrix<double> b;
    
        //Output
        RcppParallel::RVector<double> medm;
    
        minimum_euclidean_distances(const Rcpp::NumericMatrix a, const Rcpp::NumericMatrix b, Rcpp::NumericVector medm) : a(a), b(b), medm(medm) {}
    
        void operator() (std::size_t begin, std::size_t end) {
            for(std::size_t i = begin; i < end; i++) {
                double new_low = std::numeric_limits<double>::max();
                for(std::size_t j = 0; j < b.nrow(); j++) {
                    double dsum = 0;
                    for(std::size_t z = 0; z < b.ncol(); z++) {
                        dsum = dsum + pow(a(i,z) - b(j,z), 2);
                    }
                    dsum = pow(dsum, 0.5);
                    if(dsum < new_low) {
                        new_low = dsum;
                    }
                }
                medm[i] = new_low;
            }
        }
    };
    
    
    // [[Rcpp::export]]
    double mean_directional_hausdorff_rcpp(Rcpp::NumericMatrix a, Rcpp::NumericMatrix b){
        Rcpp::NumericVector medm(a.nrow());
        minimum_euclidean_distances minimum_euclidean_distances(a, b, medm);
        RcppParallel::parallelFor(0, a.nrow(), minimum_euclidean_distances);    
        double results = Rcpp::sum(medm);
        results = results / a.nrow();
        return results;
    }
    
    
    // [[Rcpp::export]]
    double max_directional_hausdorff_rcpp(Rcpp::NumericMatrix a, Rcpp::NumericMatrix b){
        Rcpp::NumericVector medm(a.nrow());
        minimum_euclidean_distances minimum_euclidean_distances(a, b, medm);
        RcppParallel::parallelFor(0, a.nrow(), minimum_euclidean_distances);    
        double results = Rcpp::max(medm);
        return results;
    }
    )
    

    使用大小为37,775和36,659的大点云的基准:

    //Rcpp serial solution
    system.time(avg_hausdorff_rcpp(ll,rr))
       user  system elapsed 
    409.143   0.000 409.105 
    
    //RcppParallel solution
    system.time(mean(mean_directional_hausdorff_rcpp(ll,rr), mean_directional_hausdorff_rcpp(rr,ll)))
       user  system elapsed 
    260.712   0.000  33.265 
    

1 个答案:

答案 0 :(得分:2)

我尝试使用JuliaCall来计算Hausdorff的平均距离。 JuliaCallJulia嵌入R.

我只在JuliaCall中尝试使用串行解决方案。它似乎比问题中的RcppParallel和Rcpp序列解决方案更快,但我没有基准数据。因为并行计算的能力是在Julia中构建的。 Julia中的并行计算版本应该没有太大困难。在发现之后我会更新我的答案。

以下是我写的julia文件:

# Calculate the min distance from the k-th point in as to the points in bs
function min_dist(k, as, bs)
    n = size(bs, 1)
    p = size(bs, 2)
    dist = Inf
    for i in 1:n
        r = 0.0
        for j in 1:p
            r += (as[k, j] - bs[i, j]) ^ 2
            ## if r is already greater than the upper bound, 
            ## then there is no need to continue doing the calculation
            if r > dist
                continue
            end
        end
        if r < dist
            dist = r
        end
    end
    sqrt(dist)
end

function avg_min_dist_from(as, bs)
    distsum = 0.0
    n1 = size(as, 1)
    for k in 1:n1
        distsum += min_dist_from(k, as, bs)
    end
    distsum / n1
end

function hausdorff_avg_dist(as, bs)
    (avg_min_dist_from(as, bs) + avg_min_dist_from(bs, as)) / 2
end

这是使用julia函数的R代码:

first_configuration <- matrix(1:6,2,3)
second_configuration <- matrix(6:11,2,3)
colnames(first_configuration) <- c("x","y","z")
colnames(second_configuration) <- c("x","y","z")

m <- nrow(first_configuration)
n <- nrow(second_configuration)

D <- sqrt(matrix(rep(apply(first_configuration * first_configuration, 1, sum), n), m, n, byrow = F) + matrix(rep(apply(second_configuration * second_configuration, 1, sum), m), m, n, byrow = T) - 2 * first_configuration %*% t(second_configuration))
D

d1 <- mean(apply(D, 1, min))
d2 <- mean(apply(D, 2, min))
average_hausdorff <- mean(d1, d2)

library(JuliaCall)
## the first time of julia_setup could be quite time consuming
julia_setup()
## source the julia file which has our hausdorff_avg_dist function
julia_source("hausdorff.jl")

## check if the julia function is correct with the example
average_hausdorff_julia <- julia_call("hausdauff_avg_dist",
                                      first_configuration,
                                      second_configuration)
## generate some large random point clouds
n1 <- 37775
n2 <- 36659
as <- matrix(rnorm(n1 * 3), n1, 3)
bs <- matrix(rnorm(n2 * 3), n2, 3)

system.time(julia_call("hausdauff_avg_dist", as, bs))

我的笔记本电脑上的时间不到20秒,请注意这是JuliaCall的串行版本的性能!我使用相同的数据来测试问题中的RCpp串行解决方案,运行时间超过10分钟。我的笔记本电脑上没有RCpp并行,所以我无法尝试。正如我所说,Julia具有内置的并行计算能力。