如何在kproto函数中实现jaccard距离

时间:2019-03-31 12:35:40

标签: r cluster-analysis

我正在尝试在kproto函数(R中的clustMixType包)中实现jaccard的距离,但是没有成功。目的是对我的数据集进行聚类分析。

我要使用的jaccard的距离是jaccard相似系数的补数,所以

distance of jaccard = 1-[a/(a+b+c)] = [(b+c)/(a+b+c)], or
distance of jaccard = 1-[M11/(M01+M10+M11)] = [(M01+M10)/(M01+M10+M11)].

下面介绍了kproto函数的源代码。变量d1是数字变量的欧式距离,变量d2是分类变量(作为因子)与简单匹配系数的距离。

它计算观测值和原型之间的距离,而不是观测值之间的距离。原型是经过计算的,而不是对自身数据集的观察。

所以我的两个问题是 1)d2是我要修改的内容,但是如何修改? 2)d1应该是正在计算的平方根吗?

感谢您的所有帮助。会很感动的。

这是我正在处理的数据集的摘录,其中V1至V4是因子(二进制)变量(以使用jaccard距离),而V5至V8是数值变量(以使用欧氏距离):

V1;V2;V3;V4;V5;V6;V7;V8
1;1;0;0;6;2;3;3
0;1;0;1;3;5;2;1
1;1;0;0;1;3;2;1
1;1;0;0;4;3;3;1
1;1;1;0;1;4;1;1
1;0;1;0;4;3;1;1
1;1;0;0;2;4;2;1
1;1;0;0;2;4;2;1
1;1;0;0;6;2;1;1
1;1;0;0;6;2;2;1
1;1;0;0;5;2;3;1
1;1;0;0;4;3;3;1
1;1;0;0;4;4;2;1
1;1;0;0;4;3;3;1
1;1;0;0;4;3;3;1
1;1;0;0;3;4;2;1
1;1;0;0;4;3;2;1
1;1;0;0;5;2;3;1
1;1;0;0;4;3;4;1
1;1;0;0;4;3;2;1
1;1;0;0;4;3;2;1
1;1;0;0;3;3;2;1
1;1;0;0;3;3;3;1
1;1;0;0;5;2;3;1
1;1;0;0;5;2;3;1
1;1;0;0;5;2;2;1
1;1;0;0;3;3;2;1
1;1;0;0;5;2;3;1
1;1;0;0;5;2;2;1
1;0;0;0;3;4;2;1
1;1;0;0;7;2;4;1
1;1;0;0;7;2;2;1
1;1;0;0;5;2;4;1
1;1;0;0;5;3;4;1
1;1;0;0;5;3;2;1
1;1;0;0;5;3;4;1
1;0;0;0;3;5;3;1
0;1;0;0;6;2;4;1
1;1;0;0;6;2;3;1
1;1;0;0;6;2;4;1

让我们以上面提供的数据集中的前两个观察为例:

V1;V2;V3;V4;V5;V6;V7;V8
1;1;0;0;6;2;3;3
0;1;0;1;3;5;2;1

该算法首先从数据集中随机选择k个原型,因此我们假设第二个观察结果是初始原型之一。据我了解,该算法最初会从提供的数据集中创建k个随机观测值,从而创建一个称为“ protos”的数据框,因此第二个观测值将是“ proto”数据框的一行。

用于聚类观察的组合距离将为d = d1 + lambda * d2。 Lambda也可以是每个变量的权重向量。 d是所提供的数据集中的观测值与最初使用k个随机观测值创建的“原型”矩阵之间的距离。

在这种情况下,考虑到出现的前两个观测,则观测值(yi)与原型(yk)之间的计算距离如下:

Euclidian for the numeric variables (V5 to V8):
 d1=sum[(yij-ykj)^2]^0,5
 where,
    k=1 to k clusters
    i=1 to n observations
    j=5 to 8 th variable
 d1=[[(6-3)^2]+[(2-5)^2]+[(3-2)^2]+[(3-1)^2]]^0,5
 d1=[9+9+1+4]^0,5
 d1=4.796

Jaccard, for the set of binary variables (V1 to V4):
 d2=[(b+c)/(a+b+c)]
 where,
    a=1
    b=1
    c=1
    are correspondences counts between the n observations and the k prototypes, for variables 1 to 4.
 d2=[(1+1)/(1+1+1)]
 d2=2/3
 d2=0.667

因此,该特定观测值与该星团的初始原型之间的总距离为:

d=d1+d2
d=4.796+0.667
d=5.463

据我所知,结果然后逐行存储在名为“ d”的矩阵中,其大小为[行数=观察数,列数=聚类数k]。

我希望能正确计算出欧几里得和雅克卡德的距离,修改kproto函数,并保持原始函数提供的步骤和结果。

注意:最终函数应可用于任何数量的观测值,变量和原型,而不仅适用于我的特定数据集。

我还尝试将kproto(clustMixType软件包)和dist.binary(ade4软件包)中的代码混合使用,但是它们以不同的方式工作。

#K-Prototypes algorithm
kproto.default <- function(x, k, lambda = NULL, iter.max = 100, nstart = 1, na.rm = TRUE, keep.data = TRUE, verbose = TRUE, ...){

  # initial error checks
  if(!is.data.frame(x)) stop("x should be a data frame!")
  if(ncol(x) < 2) stop("For clustering x should contain at least two variables!")
  if(iter.max < 1 | nstart < 1) stop("iter.max and nstart must not be specified < 1!")
  if(!is.null(lambda)){
    if(any(lambda < 0)) stop("lambda must be specified >= 0!")
    if(!any(lambda > 0)) stop("lambda must be specified > 0 for at least one variable!")
    }
  # check for numeric and factor variables
  numvars <- sapply(x, is.numeric)
  anynum <- any(numvars)
  catvars <- sapply(x, is.factor)
  anyfact <- any(catvars)
  if(!anynum) stop("\n No numeric variables in x! Try using kmodes() from package klaR...\n\n")
  if(!anyfact) stop("\n No factor variables in x! Try using kmeans()...\n\n")

  # treatment of missings
  NAcount <- apply(x, 2, function(z) sum(is.na(z)))
  if(verbose){
    cat("# NAs in variables:\n")
    print(NAcount)
  }
  if(any(NAcount == nrow(x))) stop(paste("Variable(s) have only NAs please remove them:",names(NAcount)[NAcount == nrow(x)],"!"))
  if(na.rm) {
    miss <- apply(x, 1, function(z) any(is.na(z)))
    if(verbose){
      cat(sum(miss), "observation(s) with NAs.\n")
      if(sum(miss) > 0) message("Observations with NAs are removed.\n")
      cat("\n")
    } 
    x <- x[!miss,]
    } # remove missings

  if(!na.rm){
    allNAs <- apply(x,1,function(z) all(is.na(z)))
    if(sum(allNAs) > 0){
      if(verbose) cat(sum(allNAs), "observation(s) where all variables NA.\n")
      warning("No meaningful cluster assignment possible for observations where all variables NA.\n")
      if(verbose) cat("\n")

    }
  }

  if(nrow(x) == 1) stop("Only one observation clustering not meaningful.")

  k_input <- k # store input k for nstart > 1 as clusters can be merged 

  # initialize prototypes
  if(!is.data.frame(k)){
    if (length(k) == 1){
      if(as.integer(k) != k){k <- as.integer(k); warning(paste("k has been set to", k,"!"))}
      if(nrow(x) < k) stop("Data frame has less observations than clusters!")
      ids <- sample(nrow(x), k)
      protos <- x[ids,]
    }
    if (length(k) > 1){
      if(nrow(x) < length(k)) stop("Data frame has less observations than clusters!")
      ids <- k
      k <- length(ids)
      if(length(unique(ids)) != length(ids)) stop("If k is specified as a vector it should contain different indices!")
      if(any(ids<1)|any(ids>nrow(x))) stop("If k is specified as a vector all elements must be valid indices of x!")
      #check for integer
      protos <- x[ids,]
    }
    rm(ids)
  }
  if(is.data.frame(k)){
    if(nrow(x) < nrow(k)) stop("Data frame has less observations than clusters!")
    if(length(names(k)) != length(names(x))) stop("k and x have different numbers of columns!")
    if(any(names(k) != names(x))) stop("k and x have different column names!")
    if(anynum) {if( any(sapply(k, is.numeric) != numvars)) stop("Numeric variables of k and x do not match!")}
    if(anyfact) {if( any(sapply(k, is.factor) != catvars)) stop("Factor variables of k and x do not match!")}
    protos <- k
    k <- nrow(protos)
  }
  if(k < 1) stop("Number of clusters k must not be smaller than 1!")

  # automatic calculation of lambda
  if(length(lambda) > 1) {if(length(lambda) != sum(c(numvars,catvars))) stop("If lambda is a vector, its length should be the sum of numeric and factor variables in the data frame!")}
  if(is.null(lambda)){
    if(anynum & anyfact){
      vnum <- mean(sapply(x[,numvars, drop = FALSE], var, na.rm = TRUE))
      vcat <- mean(sapply(x[,catvars, drop = FALSE], function(z) return(1-sum((table(z)/sum(!is.na(z)))^2))))
      if (vnum == 0){
        if(verbose) warning("All numerical variables have zero variance.")
        anynum <- FALSE
      } 
      if (vcat == 0){
        if(verbose) warning("All categorical variables have zero variance.")
        anyfact <- FALSE
      } 
      if(anynum & anyfact){
        lambda <- vnum/vcat
        if(verbose) cat("Estimated lambda:", lambda, "\n\n")
      }else{
        lambda <- 1
      }
    }
  }

  # initialize clusters
  clusters  <- numeric(nrow(x)) 
  tot.dists <- NULL
  moved   <- NULL
  iter <- 1

  # check for any equal prototypes and reduce cluster number in case of occurence
  if(k > 1){
    keep.protos <- rep(TRUE,k)
    for(l in 1:(k-1)){
      for(m in (l+1):k){
        d1 <- sum((protos[l,numvars, drop = FALSE]-protos[m,numvars, drop = FALSE])^2) # euclidean for numerics
        d2 <- sum(protos[l,catvars, drop = FALSE] != protos[m,catvars, drop = FALSE]) # wtd simple matching for categorics 
        if((d1+d2) == 0) keep.protos[m] <- FALSE 
      }
    }
    if(!all(keep.protos)){
      protos <- protos[keep.protos,]
      k <- sum(keep.protos)
      if(verbose) message("Equal prototypes merged. Cluster number reduced to:", k, "\n\n")      
    }
  }

  # special case only one cluster
  if(k == 1){clusters <- rep(1, nrow(x)); size  <- table(clusters); iter <- iter.max} # REM: named vector size is needed later...

  # start iterations for standard case (i.e. k > 1)
  while(iter < iter.max){

    # compute distances 
    nrows <- nrow(x)
    dists <- matrix(NA, nrow=nrows, ncol = k)
    for(i in 1:k){
      #a0 <- proc.time()[3]      
      #d1 <- apply(x[,numvars],1, function(z) sum((z-protos[i,numvars])^2)) # euclidean for numerics
      d1 <- (x[,numvars, drop = FALSE] - matrix(rep(as.numeric(protos[i, numvars, drop = FALSE]), nrows), nrow=nrows, byrow=T))^2
      if(length(lambda) == 1) d1 <- rowSums(d1, na.rm = TRUE)
      if(length(lambda) > 1) d1 <- as.matrix(d1) %*% lambda[numvars]
      #a1 <- proc.time()[3]      
      #d2 <- lambda * apply(x[,catvars],1, function(z) sum((z != protos[i,catvars]))) # wtd simple matching for categorics 
      d2 <- sapply(which(catvars), function(j) return(x[,j] != rep(protos[i,j], nrows)) )
      d2[is.na(d2)] <- FALSE
      if(length(lambda) == 1) d2 <- lambda * rowSums(d2)
      if(length(lambda) > 1) d2 <- as.matrix(d2) %*% lambda[catvars]
      #a2 <- proc.time()[3]      
      dists[,i] <- d1 + d2
      #cat(a1-a0, a2-a1, "\n")
    }

    # assign clusters 
    old.clusters  <- clusters
    # clusters      <- apply(dists, 1, function(z) which.min(z))
    clusters      <- apply(dists, 1, function(z) {a <- which.min(z); if (length(a)>1) a <- sample(a,1); return(a)}) # sample in case of multiple minima
    size          <- table(clusters)  
    min.dists     <- apply(cbind(clusters, dists), 1, function(z) z[z[1]+1])
    within        <- as.numeric(by(min.dists, clusters, sum))
    tot.within    <- sum(within)
    # prevent from empty classes
    #tot.within    <- numeric(k)
    #totw.list     <- by(min.dists, clusters, sum) 
    #tot.within[names(totw.list)] <- as.numeric(totw.list)

    # ...check for empty clusters and eventually reduce number of prototypes    
    if (length(size) < k){
      k <- length(size)
      protos <- protos[1:length(size),]  
      if(verbose) cat("Empty clusters occur. Cluster number reduced to:", k, "\n\n")
    }

    # trace
    tot.dists <- c(tot.dists, sum(tot.within))      
    moved <- c(moved, sum(clusters != old.clusters))

    # compute new prototypes
    remids <- as.integer(names(size))
    for(i in remids){
      protos[which(remids == i), numvars] <- sapply(x[clusters==i, numvars, drop = FALSE], mean, na.rm = TRUE)
      protos[which(remids == i), catvars] <- sapply(x[clusters==i, catvars, drop = FALSE], function(z) levels(z)[which.max(table(z))])
    }

    if(k == 1){clusters <- rep(1, length(clusters)); size <- table(clusters); iter <- iter.max; break}

    # check for any equal prototypes and reduce cluster number in case of occurence
    if(iter == (iter.max-1)){ # REM: for last iteration equal prototypes are allowed. otherwise less prototypes than assigned clusters.
      keep.protos <- rep(TRUE,k)
      for(l in 1:(k-1)){
        for(m in (l+1):k){
          d1 <- sum((protos[l,numvars, drop = FALSE]-protos[m,numvars, drop = FALSE])^2) # euclidean for numerics
          d2 <- sum(protos[l,catvars, drop = FALSE] != protos[m,catvars, drop = FALSE]) # wtd simple matching for categorics 
          if((d1+d2) == 0) keep.protos[m] <- FALSE 
        }
      }
      if(!all(keep.protos)){
        protos <- protos[keep.protos,]
        k <- sum(keep.protos)
        if(verbose) cat("Equal prototypes merged. Cluster number reduced to:", k, "\n\n")      
      }
    }

    # add stopping rules
    if(moved[length(moved)] ==  0) break

    if(k == 1){clusters <- rep(1, length(clusters)); size <- table(clusters); iter <- iter.max; break}

    #cat("iter", iter, "moved", moved[length(moved)], "tot.dists",tot.dists[length(tot.dists)],"\n" )      
    iter <- iter+1
  }


  ### Final update of prototypes and dists
  if(iter == iter.max){ # otherwise there have been no moves anymore and prototypes correspond to cluster assignments 
    # compute new prototypes
    remids <- as.integer(names(size))
    for(i in remids){
      protos[which(remids == i), numvars] <- sapply(x[clusters==i, numvars, drop = FALSE], mean, na.rm = TRUE)
      protos[which(remids == i), catvars] <- sapply(x[clusters==i, catvars, drop = FALSE], function(z) levels(z)[which.max(table(z))])
    }

    # compute distances 
    nrows <- nrow(x)
    dists <- matrix(NA, nrow=nrows, ncol = k)
    for(i in 1:k){
      d1 <- (x[,numvars, drop = FALSE] - matrix(rep(as.numeric(protos[i, numvars, drop = FALSE]), nrows), nrow=nrows, byrow=T))^2
      if(length(lambda) == 1) d1 <- rowSums(d1, na.rm = TRUE)
      if(length(lambda) > 1) d1 <- as.matrix(d1) %*% lambda[numvars]
      d2 <- sapply(which(catvars), function(j) return(x[,j] != rep(protos[i,j], nrows)) )
      d2[is.na(d2)] <- FALSE
      if(length(lambda) == 1) d2 <- lambda * rowSums(d2)
      if(length(lambda) > 1) d2 <- as.matrix(d2) %*% lambda[catvars]
      dists[,i] <- d1 + d2
    }

    size          <- table(clusters)  
    min.dists     <- apply(cbind(clusters, dists), 1, function(z) z[z[1]+1])
    within        <- as.numeric(by(min.dists, clusters, sum))
    tot.within    <- sum(within)
  }


  names(clusters) <- row.names(dists) <- row.names(x)
  rownames(protos) <- NULL
  # create result: 
  res <- list(cluster = clusters,  
              centers = protos, 
              lambda = lambda, 
              size = size,
              withinss = within,
              tot.withinss = tot.within,   
              dists = dists, 
              iter = iter, 
              trace = list(tot.dists = tot.dists, moved = moved))

  # loop: if nstart > 1:
  if(nstart > 1)
    for(j in 2:nstart){
      res.new <- kproto(x=x, k=k_input, lambda = lambda,  iter.max = iter.max, nstart=1, verbose=verbose)
      if(res.new$tot.withinss < res$tot.withinss) res <- res.new
    }  

  if(keep.data) res$data = x
  class(res) <- "kproto"
  return(res)
}

1 个答案:

答案 0 :(得分:0)

我设法修改了功能以接受各种相似性度量,并在有人需要的情况下将R文件上传到http://dx.doi.org/10.17632/63nyn9tjcd.1