Question

我尝试从必杀技文本中提取3个图，因此对于tfis我使用ngramrr包。

require(ngramrr)
require(tm)
require(magrittr)

nirvana <- c("hello hello hello how low", "hello hello hello how low",
             "hello hello hello how low", "hello hello hello",
             "with the lights out", "it's less dangerous", "here we are now", "entertain us",
             "i feel stupid", "and contagious", "here we are now", "entertain us",
             "a mulatto", "an albino", "a mosquito", "my libido", "yeah", "hey yay")

ngramrr(nirvana[1], ngmax = 3)

Corpus(VectorSource(nirvana))

我得到了这个结果：

[1] "hello"             "hello"             "hello"             "how"               "low"               "hello hello"       "hello hello"      
 [8] "hello how"         "how low"           "hello hello hello" "hello hello how"   "hello how low"

我想知道如何构建TermDocumentMatrix，其中的术语是三元组列表。

谢谢

Answer 1

我上面的评论几乎已经完成，但就像这样：

nirvana %>% tokens(ngrams = 1:3) %>% # generate tokens
  dfm %>% # generate dfm
  convert(to = "tm") %>% # convert to tm's document-term-matrix
  t # transpose it to term-document-matrix

用R提取ngrams

1 个答案: