K-means聚类与我自己的距离函数

时间:2013-05-06 20:34:31

标签: r distance k-means

我已经定义了一个距离函数,如下所示

jaccard.rules.dist <- function(x,y) ({
    # implements feature distance. Feature "Airline" gets a different treatment, the rest
    # are booleans coded as 1/0. Airline column distance = 0 if same airline, 1 otherwise
    # the rest of the atributes' distance is cero iff both are 1, 1 otherwise
    airline.column <- which(colnames(x)=="Aerolinea")
    xmod <- x
    ymod <-y
    xmod[airline.column] <-ifelse(x[airline.column]==y[airline.column],1,0)
    ymod[airline.column] <-1 # if they are the same, they are both ones, else they are different

    andval <- sum(xmod&ymod)
    orval <- sum(xmod|ymod)
    return (1-andval/orval)
})

修改形式

的数据帧的一点点jaccard距离
t <- data.frame(Aerolinea=c("A","B","C","A"),atr2=c(1,1,0,0),atr3=c(0,0,0,1))

现在,我想使用刚定义的距离对我的数据集执行一些k-means聚类。如果我尝试使用函数kmeans,则无法指定我的距​​离函数。我尝试使用hclust,它接受一个distanca矩阵,我计算如下

distmat <- matrix(nrow=nrow(t),ncol=nrow(t))
for (i in 1:nrow(t)) 
    for (j in i:nrow(t)) 
        distmat[j,i] <- jaccard.rules.dist(t[j,],t[i,])
distmat <- as.dist(distmat)

然后调用hclust

hclust(distmat)

Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") : 
missing value where TRUE/FALSE needed
我在做错了什么?是否有另一种方法来进行聚类,只接受任意距离函数作为其输入?

提前感谢。

1 个答案:

答案 0 :(得分:2)

我认为distmat(来自您的代码)必须是距离结构(与矩阵不同)。试试这个:

require(proxy)
d <- dist(t, jaccard.rules.dist)
clust <- hclust(d=d)
clust@centers

     [,1]         [,2]
[1,]  0.044128322 -0.039518142
[2,] -0.986798495  0.975132418
[3,] -0.006441892  0.001099211
[4,]  1.487829642  1.000431146