Question

我刚刚接触到R中的tm包。

可能是一个简单的问题，但是尝试使用findAssocs函数在我的客户查询洞察文档中了解单词关联，我似乎无法使findAssocs正常工作。< / p>

当我使用以下内容时：

findAssocs(dtm, words, corlimit = 0.30)
 $population
  numeric(0)

 $migration
 numeric(0)

这是什么意思？ Words是667个单词的字符向量 - 当然必须有一些相关的关系？

Answer 1

考虑以下示例：

library(tm)
corp <- VCorpus(VectorSource(
          c("hello world", "hello another World ", "and hello yet another world")))
tdm <- TermDocumentMatrix(corp)
inspect(tdm)
#          Docs
# Terms     1 2 3
#   and     0 0 1
#   another 0 1 1
#   hello   1 1 1
#   world   1 1 1
#   yet     0 0 1

现在考虑

findAssocs(x=tdm, terms=c("hello", "yet"), corlimit=.4)
# $hello
# numeric(0)
# 
# $yet
#     and another 
#     1.0     0.5

根据我的理解，findAssocs会查看hello与除hello和yet以外的所有内容的相关性，以及yet与除{之外的所有内容的相关性{1}}和hello。 yet和yet的相关系数为and，高于1.0的下限。 0.4也占所有包含yet的文档的50％ - 这也高于我们的0.4限制。

以下是展示此内容的另一个例子：

another

请注意，findAssocs(x=tdm, terms=c("yet", "another"), corlimit=0) # $yet # and # 1 # # $another # and # 0.5（和hello）不会产生任何结果，因为它们存在于每个文档中。这意味着术语频率的方差为零，world项下的cor产生NA（如cor(rep(1,3), 1:3)，其中NA加零标准偏差警告）。

单词关联 - findAssocs和数字（0）

1 个答案: