Question

我在R中使用tm-package创建了一个文档术语矩阵，但是我的语料库中的一些单词在某个过程中丢失了。

我将用一个例子来解释。让我们说我有这个小语料库

library(tm)
crps <- " more hours to my next class bout to go home and go night night"
crps <- VCorpus(VectorSource(crps))

当我使用tm-package中的DocumentTermMatrix()时，它将返回以下结果：

dm <- DocumentTermMatrix(crps)
dm_matrix <- as.matrix(dm)
dm_matrix
# Terms
# Docs and bout class home hours more next night
# 1   1    1     1    1     1    1    1     2

但是，我想要（和期望的）是：

# Docs and bout class home hours more next night my  go to
#  1   1    1     1    1     1    1    1     2   1   2  1

为什么DocumentTermMatrix()会跳过＆＃34;我的＆＃34;，＆＃34; go＆＃34;和＆＃34;到＆＃34;？有没有办法控制和修复这个功能？

Answer 1

DocumentTermMatrix()会自动丢弃少于三个字符的字词。因此，在构建文档字词矩阵时，不会考虑单词to，my和go。

在帮助页面?DocumentTermMatrix中，您可以看到名为control的可选参数。此可选参数具有许多用于许多内容的默认值（有关详细信息，请参阅帮助页面?termFreq）。这些默认值之一是至少三个字符的字长，即wordLengths = c(3, Inf)。您可以更改此选项以适应所有单词，无论单词长度如何：

dm <- DocumentTermMatrix(my_corpus, control = list(wordLengths=c(1, Inf))

inspect(dm)
# <<DocumentTermMatrix (documents: 1, terms: 11)>>
# Non-/sparse entries: 11/0
# Sparsity           : 0%
# Maximal term length: 5
# Weighting          : term frequency (tf)
#
#    Terms
# Docs and bout class go home hours more my next night to
#    1   1    1     1  2    1     1    1  1    1     2  2

tm包中的DocumentTermMatrix不返回所有单词

1 个答案: