使用Elbow方法查找群集数时返回多个值

时间:2016-06-11 16:03:30

标签: r machine-learning parallel-processing cluster-analysis

我尝试使用Elbow方法查找数据中的簇数,该数据名为" data.clustering"。它具有以下特征:年龄和性别。它的负责人如下: 头(data.clustering);

> head(data.clustering)
age gender
2 2 1
3 6 2
4 2 1
5 2 1
6 6 2
7 6 1

我的代码可以找到k-clusters值" data.clustering"数据框:

# include library
require(stats)
library(GMD)
library(ggplot2)
library(parallel)
# include function
source('~/Workspaces/Projects/RProject/MovielensCluster/readData.R');
###
elbow.k <- function(mydata){
## determine a "good" k using elbow
dist.obj <- dist(mydata);
hclust.obj <- hclust(dist.obj);
css.obj <- css.hclust(dist.obj,hclust.obj);
elbow.obj <- elbow.batch(css.obj);
# print(elbow.obj)
k <- elbow.obj$k
return(k)
}
# include file
filePath <- "dataset/u.user";
data.original <- readtext.tocsv(filePath);
data.convert <- readtext.convert(filePath);
data.clustering <- data.convert[,c(-1,-4)];
# find k value
no_cores <- detectCores();
cl<-makeCluster(no_cores);
clusterEvalQ(cl, library(GMD));
clusterExport(cl, list("data.clustering", "data.original", "elbow.k", "clustering.kmeans"));
start.time <- Sys.time();
k.clusters <- parSapply(cl, c(1:3), function(x) elbow.k(data.clustering));
end.time <- Sys.time();
cat('Time to find k using Elbow method is',(end.time - start.time),'seconds with k value:', k.clusters);

正如你在elbow.k函数中看到的那样。我只返回k值,但在我运行上面的代码片段之后,结果有三个k返回,因为相同的值是:

Time to find k using Elbow method is 38.39039 seconds with k value: 10 10 10

我对结果的期望是:

Time to find k using Elbow method is 38.39039 seconds with k value: 10

有人可以帮我解决吗?

1 个答案:

答案 0 :(得分:1)

我认为您的代码运行良好。但您必须编辑行代码

k.clusters <- parSapply(cl, c(1:3), function(x) elbow.k(data.clustering)); 

k.clusters <- parSapply(cl, 1, function(x) elbow.k(data.clustering));

第二个值将使k返回的数量与您的期望匹配。这只是你的功能中的简单错误。