Question

我有一个由两个文本文档和一个DocumentTermMatrix创建的语料库，我希望在其中找到单词之间的相关性。无论选择哪个单词，我选择findAssocs函数都会为语料库中的所有单词返回相关性= 1。那是为什么？

以下是我的代码摘录：

library(tm)
library(SnowballC)
doc <- Corpus(DirSource("C:/Users/biat/Documents/customersatis"))

toSpace <- content_transformer(function(x,pattern) {return (gsub(pattern, " ", x))})

doc <- tm_map(doc, toSpace, "-")
doc <- tm_map(doc, toSpace, ":")
doc <- tm_map(doc, removePunctuation)
doc <- tm_map(doc,content_transformer(tolower))
doc <- tm_map(doc,removeNumbers)
doc <- tm_map(doc,removeWords,stopwords("swedish"))
doc <- tm_map(doc,stripWhitespace)
doc <- tm_map(doc, PlainTextDocument)
doc <- tm_map(doc, stemDocument, "swedish")

dtm <- DocumentTermMatrix(doc)
findAssocs(dtm,"active",0.1)

当我运行此结果时，结果暗示术语“活动”与所有560个其他单词相关，如下所示，实际上并非如此。

$active
  admin    actions    all   analysis arrends   
      1          1      1          1       1 .........    
   ...................................................        

............................ website  workshops  
                                   1          1

Answer 1

正如scoa所述，你可能有两个文件，其中两个术语都会发生：导致一个。

尝试折叠文档，然后再将其转换为语料库：

text <- paste(unlist(text), collapse ="")

findAssocs（tm）将所有相关性作为一个列表返回

1 个答案: