如何使用DBSCAN对象和Gower距离矩阵为R中的新数据预测聚类标记

时间:2017-04-26 02:59:15

标签: r matrix distance predict dbscan

我遇到了根据训练数据上的dbscan聚类模型预测测试数据的聚类标记的问题。 我在创建模型时使用了gower距离矩阵:

> gowerdist_train <- daisy(analdata_train,
                   metric = "gower",
                   stand = FALSE,
                   type = list(asymm = c(5,6)))

使用此gowerdist矩阵,创建的dbscan聚类模型为:

> sb <- dbscan(gowerdist_train, eps = .23, minPts = 50)

然后我尝试使用predict来使用上面的dbscan对象标记测试数据集:

> predict(sb, newdata = analdata_test, data = analdata_train)

但是我收到以下错误:

  

frNN中的错误(rbind(data,newdata),eps = object $ eps,sort = TRUE,   ...):x必须是数字矩阵

我可以猜测这个错误可能来自哪里,这可能是由于缺少尚未为测试数据创建的gower距离矩阵。 我的问题是,我应该分别为所有数据(datanal_train + datanal_test)创建一个gower距离矩阵并将其输入预测吗?算法怎么知道列车数据中测试数据的距离是多少,以便标记?

在这种情况下,newdata参数是否是包含ALL(train + test)数据的新gower距离矩阵?预测中的数据参数是训练距离矩阵,gowerdist_train?

我不太确定预测算法如何区分新创建的gowerdist_all矩阵中的测试和训练数据集?

两个矩阵(所有数据的新gowerdist和gowerdist_train)显然不具有相同的维度。另外,仅为测试数据创建gower距离矩阵对我来说没有意义,因为距离必须相对于测试数据而不是测试数据本身。

编辑:

我尝试将gower距离矩阵用于所有数据(训练+测试)作为我的新数据,并在用于预测时收到错误:

> gowerdist_all <- daisy(rbind(analdata_train, analdata_test),
                         metric = "gower",
                         stand = FALSE,
                         type = list(asymm = c(5,6)))
> test_sb_label <- predict(sb, newdata = gowerdist_all, data = gowerdist_train)
  

错误:1中出错:nrow(数据):长度为0的参数另外:   警告消息:在rbind(data,newdata)中:列数   结果不是矢量长度的倍数(arg 1)

所以,我建议的解决方案不起作用。

1 个答案:

答案 0 :(得分:1)

我决定创建一个代码,在dbscan中使用KNN算法来使用gower距离矩阵来预测聚类标记。代码不是很漂亮,绝对不是programmaticaly高效但它的工作原理。对任何可以改善它的建议感到高兴。

pseydocode是: 1)计算所有数据的新gower距离矩阵,包括测试和训练 2)在kNN函数(dbscan包)中使用上述距离矩阵来确定每个测试数据点的k个最近邻居。 3)确定每个测试点的所有那些最近点的簇标签。其中一些将没有集群标签,因为它们本身就是测试点 4)创建计数矩阵以计算每个测试点的k个最近点的簇的频率 5)使用非常简单的似然计算,基于其邻居簇(最大频率)为测试点选择簇。这部分还考虑了相邻的测试点。也就是说,只有当您将相邻测试点的数量添加到其他群集时,最大频率最大时,才会选择测试点的群集。否则,它不会决定该测试点的集群,并且当希望更多的相邻测试点根据其邻居确定其集群标签时,等待下一次迭代。 6)重复上面的步骤(步骤2-5),直到你决定了所有的簇

**注意:此算法不会一直收敛。 (一旦你做了数学运算,很明显为什么会这样)所以,在代码中,当非聚集测试点的数量在一段时间后没有改变时,我会突破算法。然后我再次使用new knn重复2-6(更改最近邻居的数量,然后再次运行代码)。这将确保在决定轮次时涉及更多的积分。我已经尝试了更大更小的knn并且都工作。很高兴知道哪一个更好。到目前为止,我还没有必要运行两次以上的代码来确定测试数据点的集群。

以下是代码:

#calculate gower distance for all data (test + train)
gowerdist_test <- daisy(all_data[rangeofdataforgowerdist],
                        metric = "gower",
                        stand = FALSE,
                        type = list(asymm = listofasymmvars),
                        weights = Weights)
summary(gowerdist_test) 

然后使用下面的代码标记测试数据的簇。

#library(dbscan)
# find the k nearest neibours for each point and order them with distance
iteration_MAX <- 50
iteration_current <- 0
maxUnclusterRepeatNum <- 10
repeatedUnclustNum <- 0
unclusteredNum <- sum(is.na(all_data$Cluster))
previousUnclustereNum <- sum(is.na(all_data$Cluster))
nn_k = 30 #number of neighbourhoods

while (anyNA(all_data$Cluster) & iteration_current < iteration_MAX) 
{
  if (repeatedUnclustNum >= maxUnclusterRepeatNum) {
    print(paste("Max number of repetition (", maxUnclusterRepeatNum ,") for same unclustered data has reached. Clustering terminated unsuccessfully."))
    invisible(gc())
    break;
  }

      nn_test <- kNN(gowerdist_test, k = nn_k, sort = TRUE)

    # for the TEST points in all data, find the closets TRAIN points and decide statistically which cluster they could belong to, based on the clusters of the nearest TRAIN points
    test_matrix <- nn_test$id[1: nrow(analdata_test),] #create matrix of test data knn id's
    numClusts <- nlevels(as.factor(sb_train$cluster))
    NameClusts <- as.character(levels(as.factor(sb_train$cluster)))
    count_clusters <- matrix(0, nrow = nrow(analdata_test), ncol = numClusts + 1)  #create a count matrix that would count number of clusters + NA
    colnames(count_clusters) <- c("NA", NameClusts) #name each column of the count matrix to cluster numbers

    # get the cluster number of each k nearest neibhour of each test point
    for (i in 1:nrow(analdata_test)) 
      for (j in 1:nn_k)
      {  
        test_matrix[i,j] <- all_data[nn_test$id[i,j], "Cluster"]
      }
    # populate the count matrix for the total clusters of the neighbours for each test point
    for (i in 1:nrow(analdata_test))
      for (j in 1:nn_k)
      {  
       if (!is.na(test_matrix[i,j])) 
           count_clusters[i, c(as.character(test_matrix[i,j]))] <- count_clusters[i, c(as.character(test_matrix[i,j]))] + 1
       else 
          count_clusters[i, c("NA")] <- count_clusters[i, c("NA")] + 1
      }
    # add NA's (TEST points) to the other clusters for comparison
    count_clusters_withNA <- count_clusters
    for (i in 2:ncol(count_clusters))
      {  
      count_clusters_withNA[,i] <- t(rowSums(count_clusters[,c(1,i)]))
    }

    # This block of code decides the maximum count of cluster for each row considering the number other test points (NA clusters) in the neighbourhood
    max_col_countclusters <- apply(count_clusters,1,which.max) #get the column that corresponds to the maximum value of each row
    for (i in 1:length(max_col_countclusters)) #insert the maximum value of each row in its associated column in count_clusters_withNA
      count_clusters_withNA[i, max_col_countclusters[i]] <- count_clusters[i, max_col_countclusters[i]]
    max_col_countclusters_withNA <- apply(count_clusters_withNA,1,which.max) #get the column that corresponds to the maximum value of each row with NA added 
    compareCountClust <- max_col_countclusters_withNA == max_col_countclusters  #compare the two count matrices
    all_data$Cluster[1:nrow(analdata_test)] <- ifelse(compareCountClust, NameClusts[max_col_countclusters - 1], all_data$Cluster) #you subtract one because of additional NA column


    iteration_current <- iteration_current + 1

    unclusteredNum <- sum(is.na(all_data$Cluster))
    if (previousUnclustereNum == unclusteredNum)
      repeatedUnclustNum <- repeatedUnclustNum + 1
    else {
      repeatedUnclustNum <- 0
      previousUnclustereNum <- unclusteredNum
    }

    print(paste("Iteration: ", iteration_current, " - Number of remaining unclustered:", sum(is.na(all_data$Cluster))))
    if (unclusteredNum == 0)
      print("Cluster labeling successfully Completed.")

    invisible(gc())
}

我想你可以将它用于任何其他类型的聚类算法,只要你在运行代码之前在你的all_data中确定列车数据的集群标签就没关系。 希望这有帮助。 不是最有效或最严格的代码。所以,很高兴看到如何改进它的建议。

*注意:我使用t-SNE来比较火车的集群和测试数据,看起来非常干净。所以,它似乎正在发挥作用。