Question

我尝试使用k-means聚类来选择人口中最多样化的标记，例如，如果我们想要选择100行，我将整个群体聚类为100个聚类，然后从每个聚类中选择最靠近质心的标记群集。

我的解决方案的问题是需要花费太多时间（可能是我的功能需要优化），特别是当标记的数量超过100000时。

所以，如果有人能够向我展示一种新方法来选择能够最大化我的人口多样性的标记和/或帮助我优化我的功能以使其更快地工作，我将非常感激。

谢谢

# example:

library(BLR)
data(wheat)
dim(X)
mdf<-mostdiff(t(X), 100,1,nstart=1000)

这是我使用的最新函数：

mostdiff <- function(markers, nClust, nMrkPerClust, nstart=1000) {
    transposedMarkers <- as.array(markers)
    mrkClust <- kmeans(transposedMarkers, nClust, nstart=nstart)
    save(mrkClust, file="markerCluster.Rdata")

    # within clusters, pick the markers that are closest to the cluster centroid
    # turn the vector of which markers belong to which clusters into a list nClust long
    # each element of the list is a vector of the markers in that cluster

    clustersToList <- function(nClust, clusters) {
        vecOfCluster <- function(whichClust, clusters) {
            return(which(whichClust == clusters))
        }
        return(apply(as.array(1:nClust), 1, vecOfCluster, clusters))
    }

    pickCloseToCenter <- function(vecOfCluster, whichClust, transposedMarkers, centers, pickHowMany) {
        clustSize <- length(vecOfCluster)
        # if there are fewer than three markers, the center is equally distant from all so don't bother
        if (clustSize < 3) return(vecOfCluster[1:min(pickHowMany, clustSize)])

        # figure out the distance (squared) between each marker in the cluster and the cluster center
        distToCenter <- function(marker, center){
            diff <- center - marker    
            return(sum(diff*diff))
        }

        dists <- apply(transposedMarkers[vecOfCluster,], 1, distToCenter, center=centers[whichClust,])
        return(vecOfCluster[order(dists)[1:min(pickHowMany, clustSize)]]) 
    }
}

Answer 1

您可以尝试下面的内容，但我认为代码中最慢的部分实际上是kmeans。对于大型数据集，您可以根据数据的形状考虑减少nstart参数或子集。

library(plyr)

markers <- data.frame(x=rnorm(1e6), y=rnorm(1e6), z=rnorm(1e6))

mostdiff <- function(markers, iter.max=1e5) {
    ncols <- ncol(markers)

    km <- kmeans(markers, 100, iter.max=iter.max)

    markers$cluster <- km$cluster
    markers$d <- rowSums(apply(
        markers[,1:ncols] - km$centers[markers$cluster], 2, function(x) x * x
    ))

    result <- subset(
        merge(
            ddply(markers, ~cluster, summarise, d=min(d)),
            markers,
            x.all=T, y.all=F
        ),
        select=-c(d, cluster)
    )

    return(result)
}

mostdiff(markers, 100)

Answer 2

如果您正在寻找人口中的异常值而不一定是用于识别它们的“标记”，我建议使用mahalanobis distance。它通常是用于异常值识别的首选工具。

k <- 1000 # Number of outliers from the population we want
n <- length(x)
ma.dist <- mahalanobis(x, colMeans(x), cov(x))
ix <- order(ma.dist)
mdf <- x[ix >= n - k]

Answer 3

如果kmeans是最耗费的部分，您可以将k-means算法应用于人口的随机子集。如果随机子集的大小与您选择的质心数相比仍然很大，您将得到大部分相同的结果。或者，您可以在多个子集上运行多个kmeans并合并结果。

另一种选择是尝试k-medoid算法，该算法将给出作为总体一部分的质心，因此不需要找到最接近其质心的每个群集的成员的第二部分。它可能比k-means慢。

Answer 4

如果其他任何身体试图做同样的事情。这是基于damienfrancois建议的答案：除了使用原始数据之外，pam k-medriod允许我们使用自己的距离矩阵，这在我们在标记数据中有如此多的缺失值的情况下非常重要。

library(BLR)

data(wheat)

library(cluster)

pam_out<-pam(t(X),100)

selec.markers<-as.data.frame(colnames(X)[pam_out$id.med])

从群体中选择最不相似的个体的最佳方法是什么？

4 个答案: