计算r中的单词共出现矩阵

时间:2016-11-07 11:24:32

标签: r text-mining

我想在R中计算一个单词共生矩阵。我有以下句子数据框 -

dat <- as.data.frame("The boy is tall.", header = F, stringsAsFactors = F)
dat[2,1] <- c("The girl is short.")
dat[3,1] <- c("The tall boy and the short girl are friends.")

哪个给了我

The boy is tall.
The girl is short.
The tall boy and the short girl are friends.

我想要做的是首先列出所有三个句子中的所有独特单词,即

The
boy
is
tall
girl
short
and
are
friends

然后,我想创建一个单词共生矩阵,它计算在一个句子中共同出现的单词的次数,这看起来像这样

       The   boy    is    tall    girl    short    and    are    friends
The     0     2      2      2        2        2      1      1    1
boy     2     0      1      2        1        1      1      1    1
is      2     1      0      2        1        1      0      0    0
tall    2     2      1      0        1        1      1      1    1
etc.

对于所有单词,单词不能与自己共同出现。请注意,在句子3中,单词&#34;&#34;出现两次,解决方案应该只计算一次共同出现的次数&#34;&#34;。

有没有人知道我该怎么做。我正在使用大约3000个句子的数据框。

1 个答案:

答案 0 :(得分:4)

library(tm)
library(dplyr)
dat      <- as.data.frame("The boy is tall.", header = F, stringsAsFactors = F)
dat[2,1] <- c("The girl is short.")
dat[3,1] <- c("The tall boy and the short girl are friends.")

ds  <- Corpus(DataframeSource(dat))
dtm <- DocumentTermMatrix(ds, control=list(wordLengths=c(1,Inf)))

X         <- inspect(dtm)
out       <- crossprod(X)  # Same as: t(X) %*% X
diag(out) <- 0             # rm own-word occurences
out
        Terms
Terms    boy friend girl short tall the
  boy      0      1    1     1    2   2
  friend   1      0    1     1    1   1
  girl     1      1    0     2    1   2
  short    1      1    2     0    1   2
  tall     2      1    1     1    0   2
  the      2      1    2     2    2   0

您可能还想删除&#34;&#34;,即

等停用词
ds <- tm_map(ds, stripWhitespace)
ds <- tm_map(ds, removePunctuation)
ds <- tm_map(ds, stemDocument)
ds <- tm_map(ds, removeWords, c("the", stopwords("english")))
ds <- tm_map(ds, removeWords, c("the", stopwords("spanish")))