Question

我想聚类相似含义的短语并绘制树状图。我也想显示一个分组短语列表。我似乎只能显示以索引号作为输出而不是短语本身的树状图。另外，我还有数百个短语，希望将它们显示为按最大组排序的分组列表。

strings.to.cluster <- c("how do i find the bus times", 
                    "where do i find the bus time tables", 
                    "where is the bus times",
                    "is there a bus time table", 
                    "where is the bus time table", 
                    "what is the meaning of life", 
                    "the quick brown fox", 
                    "how do i find the bus times", 
                    "where is the bus times")
library(tm)
library(Matrix)
x <- TermDocumentMatrix( Corpus( VectorSource( strings.to.cluster ) ) )
y <- sparseMatrix( i=x$i, j=x$j, x=x$v, dimnames = dimnames(x) )  
plot( hclust(dist(t(y))) )

Answer 1

如果您正在使用tm软件包和sparseMatrix，那么您会将字符串变成单词。您的树状图将是单词而不是句子。如果不转置矩阵并使用plot(hclust(dist(y)))，请检查会发生什么。您将看到您得到了单词，但没有句子。

使用stringdist包，我们可以计算所有句子之间的距离，然后将此距离矩阵用于hclust。使用选项useNames =“ strings”，我们将字符串作为标签添加到距离矩阵中，并且这些字符串将用作hclust对象中的标签。

cl <- hclust(stringdist::stringdistmatrix(strings.to.cluster, method = "cosine", useNames = "strings"))
plot(cl)

如果您对单独的单词簇更感兴趣，则可能需要检查Quanteda软件包中的可用功能。但一定要阅读主题建模。

R中具有相似含义的聚类短语

1 个答案: