堪培拉距离 - 结果不一致

时间:2016-08-11 11:07:50

标签: r distance

我正在试图了解我对堪培拉距离的计算情况。我编写了自己的简单canberra.distance函数,但结果与dist函数不一致。我在我的函数中添加了选项na.rm = T,以便能够在零分母时计算总和。来自?dist我知道他们使用类似的方法:Terms with zero numerator and denominator are omitted from the sum and treated as if the values were missing.

canberra.distance <- function(a, b){
  sum( (abs(a - b)) / (abs(a) + abs(b)), na.rm = T )
}

a <- c(0, 1, 0, 0, 1)
b <- c(1, 0, 1, 0, 1)
canberra.distance(a, b)
> 3 
# the result that I expected
dist(rbind(a, b), method = "canberra")
> 3.75 


a <- c(0, 1, 0, 0)
b <- c(1, 0, 1, 0)
canberra.distance(a, b)
> 3
# the result that I expected
dist(rbind(a, b), method = "canberra")
> 4   

a <- c(0, 1, 0)
b <- c(1, 0, 1)
canberra.distance(a, b)
> 3
dist(rbind(a, b), method = "canberra")
> 3
# now the results are the same

对0-0和1-1似乎有问题。在第一种情况(0-0)中,分子和分母都等于零,并且应该省略该对。在第二种情况(1-1)中,分子为0但分母不是,且该项也是0,并且总和不应改变。

我在这里缺少什么?

修改 为了与R定义一致,可以按如下方式修改函数canberra.distance

canberra.distance <- function(a, b){
  sum( abs(a - b) / abs(a + b), na.rm = T )
}

然而,结果与以前相同。

1 个答案:

答案 0 :(得分:0)

这可能会对差异有所了解。据我所知,这是用于计算距离的实际代码

static double R_canberra(double *x, int nr, int nc, int i1, int i2)
{
    double dev, dist, sum, diff;
    int count, j;

    count = 0;
    dist = 0;
    for(j = 0 ; j < nc ; j++) {
    if(both_non_NA(x[i1], x[i2])) {
        sum = fabs(x[i1] + x[i2]);
        diff = fabs(x[i1] - x[i2]);
        if (sum > DBL_MIN || diff > DBL_MIN) {
        dev = diff/sum;
        if(!ISNAN(dev) ||
           (!R_FINITE(diff) && diff == sum &&
            /* use Inf = lim x -> oo */ (int) (dev = 1.))) {
            dist += dev;
            count++;
        }
        }
    }
    i1 += nr;
    i2 += nr;
    }
    if(count == 0) return NA_REAL;
    if(count != nc) dist /= ((double)count/nc);
    return dist;
}

我认为罪魁祸首就是这条线

if(!ISNAN(dev) ||
               (!R_FINITE(diff) && diff == sum &&
                /* use Inf = lim x -> oo */ (int) (dev = 1.))) 

处理特殊情况,可能没有记录。