在R中的新数据集上使用已创建的kmeans kluster模型

时间:2016-01-19 15:32:15

标签: r cluster-computing

我在R(kmeans)中构建了一个集群模型:

fit <- kmeans(sales_DP_DCT_agg_tr_bel_mod, 4)

现在我想使用这个模型并细分一个全新的数据集。我怎么能:

  1. 存储模型
  2. 在新数据集上运行模型?

1 个答案:

答案 0 :(得分:1)

我们假设您正在使用iris作为数据集。

data = iris[,1:4] ## Don't want the categorical feature
model = kmeans(data, 3)

这是输出的样子:

>model
K-means clustering with 3 clusters of sizes 96, 33, 21

Cluster means:
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1     6.314583    2.895833     4.973958   1.7031250
2     5.175758    3.624242     1.472727   0.2727273
3     4.738095    2.904762     1.790476   0.3523810

Clustering vector:
  [1] 2 3 3 3 2 2 2 2 3 3 2 2 3 3 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 3 3 2 2 3 2 3 2 2 1 1 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [76] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Within cluster sum of squares by cluster:
[1] 118.651875   6.432121  17.669524
 (between_SS / total_SS =  79.0 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"         "iter"         "ifault"

请注意,您可以使用model$centers访问质心。您需要做的就是对传入的样本进行分类,找到它最接近的质心。您可以按如下方式定义欧氏距离函数:

eucDist <- function(x, y) sqrt(sum( (x-y)^2 ))

然后是分类功能:

classifyNewSample <- function(newData, centroids = model$centers) {
  dists = apply(centroids, 1, function(y) eucDist(y,newData))
  order(dists)[1]
}

> classifyNewSample(c(7,3,6,2))
[1] 1
> classifyNewSample(c(6,2.7,4.3,1.4))
[1] 2

就模型持久性而言,结帐?save here

编辑:

将预测函数应用于新矩阵:

## I'm just generating a random matrix of 50x4 here:
r <- 50
c <- 4
m0 <- matrix(0, r, c)
new_data = apply(m0, c(1,2), function(x) sample(seq(0,10,0.1),1))
new_labels = apply(new_data, 1, classifyNewSample)

>new_labels
[1] 1 2 3 3 2 1 3 1 3 1 2 3 3 1 1 3 1 1 1 3 1 1 1 1 1 1 3 1 1 3 3 1 1 3 2 1 3 2 3 1 2 1 2 1 1 2 1 3 2 1