聚类:如何提取最有特色的功能?

时间:2014-01-17 11:53:27

标签: r cluster-analysis text-mining

我有一组文档,我试图根据他们的词汇表进行聚类(即首先使用DocumentTermMatrix命令创建一个语料库然后是一个稀疏矩阵,依此类推)。为了改进群集并更好地理解哪些特征/单词使特定文档落入特定群集,我想知道每个群集的最显着特征是什么。

Lantz的 Machine Learning with R 一书中有一个例子,如果你碰巧知道它 - 他根据他们所盯住的兴趣聚集青少年社交媒体档案,并最终得到像这样的表格显示“每个集群......具有最能将其与其他集群区分开来的特征”:

cluster 1  | cluster 2 | cluster 3 ....
swimming   | band      | sports  ... 
dance      | music     | kissed ....

现在,我的功能并不是那么有用,但是我仍然希望能够构建类似的东西
但是,这本书没有解释表格是如何构建的。我已经尽力去创造性地谷歌了,也许答案是对集群意味着一些明显的计算,但作为R的新手以及统计数据,我无法弄明白。非常感谢任何帮助,包括我之前可能错过的先前问题或其他资源的链接!

感谢。

1 个答案:

答案 0 :(得分:2)

我前段时间遇到过类似的问题..

这是我做的:

require("tm")
require("skmeans")
require("slam")

# clus: a skmeans object
# dtm: a Document Term Matrix
# first: eg. 10 most frequent words per cluster
# unique: if FALSE all words of the DTM will be used
#         if TRUE only cluster specific words will be used 



# result: List with words and frequency of words 
#         If unique = TRUE, only cluster specific words will be considered.
#         Words which occur in more than one cluster will be ignored.



mfrq_words_per_cluster <- function(clus, dtm, first = 10, unique = TRUE){
  if(!any(class(clus) == "skmeans")) return("clus must be an skmeans object")

  dtm <- as.simple_triplet_matrix(dtm)
  indM <- table(names(clus$cluster), clus$cluster) == 1 # generate bool matrix

  hfun <- function(ind, dtm){ # help function, summing up words
    if(is.null(dtm[ind, ]))  dtm[ind, ] else  col_sums(dtm[ind, ])
  }
  frqM <- apply(indM, 2, hfun, dtm = dtm)

  if(unique){
    # eliminate word which occur in several clusters
    frqM <- frqM[rowSums(frqM > 0) == 1, ] 
  }
  # export to list, order and take first x elements 
  res <- lapply(1:ncol(frqM), function(i, mat, first)
                head(sort(mat[, i], decreasing = TRUE), first),
                mat = frqM, first = first)

  names(res) <- paste0("CLUSTER_", 1:ncol(frqM))
  return(res)
}

一个小例子:

data("crude")
dtm <- DocumentTermMatrix(crude, control =
                          list(removePunctuation = TRUE,
                               removeNumbers = TRUE,
                               stopwords = TRUE))

rownames(dtm) <- paste0("Doc_", 1:20)
clus <- skmeans(dtm, 3)


mfrq_words_per_cluster(clus, dtm)
mfrq_words_per_cluster(clus, dtm, unique = FALSE)

HTH